2  Case Study: The Titanic

flowchart LR
imp[Import]
rec[Recode]
annot[Annotate]
anal[Analysis]
uni[Univariate
Descriptive
Analysis] ds[Data
Snapshot] biv[Descriptive
Association
Analysis] imp --> rec --> annot --> ds --> anal anal --> uni & biv

2.1 Introduction

This chapter provides a fairly complete case study involving data import, recoding, annotation using metadata from an external file, and descriptive analyses. Many of the techniques used here are only briefly described, and more thorough descriptions are given in the chapters that follow. Marginal notes refer to the pertinent later sections. The case study requires a bit more complex coding than we will see later, to deal with British money in 1912. Don’t worry, coding will get easier for you with time after we build up with simpler examples in later chapters.

2.2 Data

The Regression Modeling Strategies text and course notes contains a detailed case study where survival patterns of passengers on the Titanic were analyzed. That analysis was based on titanic3, an older version of the data on 1309 passengers in which a large number of passengers had missing age. David Beltran del Rio used updated data from Encyclopedia Titanica to create a much improved dataset titanic5 with far fewer missing ages. The data are available at hbiostat.org/data/repo/titanic5.xlsx as a 4-sheet Excel file, and the key analysis sheet for passengers is available at hbiostat.org/data/repo/titanic5.csv

2.3 Importing the Data

The .xlsx file passenger sheet’s date of birth field was not handled correctly by either the R readxl package or the openxlsx package, even when reading the field as a plain text field. Somehow Excel intercepted the way the field was stored in the sheet that made it impossible to read it correctly from the binary file. The sheet had to be opened in Excel itself and saved as a .csv file for the dates not to be distorted. The .csv file is imported below. The readxl package is used only to read the metadata (data dictionary) sheet.

Some non-numeric variables used blanks to denote missing. We systematically change those blanks to true NAs. The R trimws function is used to trim white space from the values so we do not need to distinguish '' from ' ' and ' '.

# Load two packages
require(Hmisc)
require(data.table)
getRs('reptools.r')    # Loads reptools.r from Github
hookaddcap()   # make knitr call a function at the end of each chunk
               # to automatically add to list of figures

# readxl does not handle URLs so download the Excel file (for metadata)
file <- tempfile()
download.file('https://hbiostat.org/data/repo/titanic5.xlsx', 
              destfile=file)
t5 <- csv.get('https://hbiostat.org/data/repo/titanic5.csv',
             lowernames=TRUE, allow='_')
# Convert blanks in character fields to NAs
g <- function(x)
  if(is.character(x)) ifelse(trimws(x) == '', NA, x) else x
t5 <- lapply(t5, g)
# Change to a data table
setDT(t5)
# Copy t5 into d so we can save original data t5 for later
# With just d <- t5, d and t5 would occupy the same memory
# and changes to data.table d would cause t5 to change
d <- copy(t5)
# Rename dob_clean to dob
setnames(d, 'dob_clean', 'dob')
head(d)
   name_id                                 name    sex   age class ticket
1:      28        ALLEN, Miss Elisabeth Walton  female 29.00     1  24160
2:      35       ALLISON, Master Hudson Trevor    male  0.92     1 113781
3:      36         ALLISON, Miss Helen Loraine  female  2.00     1 113781
4:      37 ALLISON, Mr Hudson Joshua Creighton    male 30.00     1 113781
5:      38           ALLISON, Mrs Bessie Waldo  female 25.00     1 113781
6:      44                  ANDERSON, Mr Harry    male 47.00     1  19952
        joined    occupation boat..body.       price  job survived
1: Southampton          <NA>           2 £211 60s 9d <NA>        1
2: Southampton          <NA>          11    £151 16s <NA>        1
3: Southampton          <NA>        <NA>    £151 16s <NA>        0
4: Southampton Businessman         [135]    £151 16s <NA>        0
5: Southampton          <NA>        <NA>    £151 16s <NA>        0
6: Southampton Stockbroker             3     £26 11s <NA>        1
                                             url date_death        dob
1: /titanic-survivor/elisabeth-walton-allen.html 12/15/1967  10/1/1882
2:         /titanic-survivor/trevor-allison.html   8/7/1929   5/7/1911
3:          /titanic-victim/loraine-allison.html  4/15/1912   6/5/1909
4:  /titanic-victim/hudson-joshua-creighton.html  4/15/1912  12/9/1881
5:     /titanic-victim/bessie-waldo-allison.html  4/15/1912 11/14/1886
6:         /titanic-survivor/harry-anderson.html 11/23/1951 10/20/1864
   age_f_code age_f sibsp parch
1:          C    29     0     0
2:          C     1     1     2
3:          C     2     1     2
4:          C    30     1     2
5:          C    25     1     2
6:          C    47     0     0
# Read metadata
dd <- readxl::read_xlsx(file, sheet='Titanic5_metadata')
# Keep only the regular part of the dictionary
setDT(dd)
dd <- dd[, .(Variable, Description)]
# Change names to lower case and replace blank and [] with . as
# csv.get did.  makeNames is in Hmisc
dd[, Variable := makeNames(tolower(Variable), allow='_')]
dd[Variable == 'dob_clean', Variable := 'dob']
# Print variable names and descriptions pasted together to save
# horizontal space
# \t is the tab character
# Use writeLines instead of ordinary print to omit row numbers and
# respect \t
dd[, writeLines(paste0(Variable, '\t: ', Description))]
name_id : Unique ID for Passenger / Crew member
name    : Full Name (LAST, Title First)
female  : Female indicator
male    : Male indicator
sex : Sex (text, male/female)
age : Age, numeric (fractional - 10 months = 0.8333)
class.dept  : Text field for passenger class or crew department
class   : Class of Passenger or Department of Crew - 1,2,3,B,D,E,R,V (see key)
ticket  : Ticket number
joined  : Port of embarcation
occupation  : Job / Career
boat..body. : Boat (rescued survivor), Body (identified victim)
price   : Price of ticket, Pounds
job : Second field of career info / company, role, various
survived    : Survived indicator
url : Encylopedia Titanic URL filename (starts with http://www.encyclopedia-titanica.org)
last    : Last Name
title   : Title / Salutation
first   : First Name
dob : Date of birth if available
year_birth  : Year or Year/Month of birth if available
date_death  : Date of death
dob : Date of birth, cleaned, derived when only year or month/year of birth available, supplemented with values from "Age" variable if "DoB" or "Year_Birth" not available
age_f_code  : A code to indicate the method by which
age_f   : Age "Final" - all ages for which any data exists, see "Age_F_Code" for how it was derived
sibsp   : Number of Siblings/Spouses Aboard - obtained from familiar "Titanic3" dataset, merged via "Name_ID" varibale
parch   : Number of Parents/Children Aboard - obtained from familiar "Titanic3" dataset, merged via "Name_ID" varibale
NULL

2.4 Recoding Data

Start by recoding date of birth from m/d/yyyy format to become a standard R date in yyyy-mm-dd format.

d[, dob := as.Date(dob, '%m/%d/%Y')]

Check whether dob is missing if and only if the original integer age is missing. is.na is the R function that returns TRUE/FALSE if a variable’s value is missing (NA) or not.

d[, table(is.na(age), is.na(dob))]
       
        FALSE TRUE
  FALSE  1031  227
  TRUE     49    2

Not quite. dob is missing more often than age. When both are non-missing, check that they are consistent. Stratify by whether/how birth month or day were imputed.

# Compute age in days as of the date of the disaster, then convert to
# fractional years
# as.numeric makes the result a standard numeric variable instead of a
# "difftime" variable
d[, age := as.numeric((as.Date('1912-04-15') - dob) / 365.25)]
ggplot(d, aes(x = age - age_f)) + geom_histogram() +
  facet_wrap(~ age_f_code)

Agreement is reasonable. A spike occurs for those passengers who had to have their birth month and day were imputed to Jun 30. Code C is for complete date of birth, and D corresponds to unknown birth day imputed to 15. For passengers with dob present use our newly computed age, and for others use the original integer age.

d[, age := ifelse(is.na(age), age_f, age)]
# Remove old age variables
d[, .q(age_f, age_f_code) := NULL]     # .q is in Hmisc

The Titanic ticket purchase price needs to undergo some moderately complex recoding to convert from British pounds, shillings, etc. to pounds + fraction of a pound. The prices are in the format £\(x\) \(y\)s \(z\)d for \(x\) pounds, \(y\) shillings (\(\frac{1}{20}\) pound) and \(z\) pennies (\(\frac{1}{12}\) shilling). We need to compute pounds in decimals which is \(x + \frac{y}{20} + \frac{z}{12 \times 20}\). See here for more information.

Start by splitting price with a space as field delimiter. The R strsplit function splits each observation’s string on the specified delimiter and creates a list with length equal to the number of observations. Each element of the list is a vector containing all the sub-fields found during splitting. To keep our R global environment from being cluttered with intermediate variables, and to possibly re-use code later, package the whole conversion in a function called pounds2dec. To avoid repetitive coding within pounds2dec a service function g is written. An argument check in pounds2dec causes various intermediate results to be printed for checking data and logic. The Hmisc package prn function is used to print the code producing intermediate output along with the output. The conversion process allows pounds, shillings, or pennies notation to be out of order within the price character string.

pounds2dec <- function(price, check=FALSE) {
  p <- strsplit(price, ' ')
  if(check) prn(table(sapply(p, length)), 'Distribution of number of fields found')
  # Get 1st, 2nd, 3rd fields (pad with blanks)
  p1 <- sapply(p, function(x) x[1])
  p2 <- sapply(p, function(x) if(length(x) < 2) ' ' else x[2])
  p3 <- sapply(p, function(x) if(length(x) < 3) ' ' else x[3])
  if(check) {
    prn(table(p1), 'Frequencies of occurrence of all 1st field values')
    prn(table(p2), '2nd field')
    prn(table(p3), '3rd field')
    prn(table(substring(p1, 1, 1)),       'First letter of 1st field')
    prn(table(substring(p2, nchar(p2))),  'Last letter of 2nd field')
    prn(table(substring(p3, nchar(p3))),  'Last letter of 3rd field')
  }
  # Put the 3 fields into a matrix for easy addressing
  ps   <- cbind(p1, p2, p3)
  pdec <- rep(0, length(p1))
  # Function to return 0 if a specific currency symbol r is not
  # found in x, otherwise removes the symbol and returns the
  # remaining characters as numeric
  # grepl returns TRUE if r is inside x, FALSE otherwise
  # gsub replaces character r with "nothing" (''), i.e., removes r
  g <- function(x, r) ifelse(grepl(r, x),
                             as.numeric(gsub(r, '', x)), 0)
  for(j in 1 : 3)  {
    pj   <- ps[, j]    # jth sub-field of price for all passengers
    pdec <- pdec + g(pj, '£') + g(pj, 's') / 20 + g(pj, 'd') / (12 * 20)
  }
  # If original price is NA or is either empty or blank
  # make sure result is NA
  pdec[is.na(price) | trimws(price) == ''] <- NA
  if(check) print(data.table(price, pdec)[1:10])
  pdec   # last object named in function is the returned value
}

Apply the function to do the conversion, replacing the price variable in the data table.

d[, price := pounds2dec(price, check=TRUE)]

Distribution of number of fields found   table(sapply(p, length))


  1   2   3 
230 497 582 

Frequencies of occurrence of all 1st field values   table(p1)

p1
 £10 £100 £106 £108  £11 £110 £113  £12 £120  £13 £133 £134 £135 £136  £14 £146 
  40    2    3    3   11    4    3   20    4   79    2    5    3    2   31    3 
 £15 £151 £153  £16 £164  £17  £18  £19  £20  £21 £211  £22 £221  £23  £24 £247 
  42    6    3   17    4    4    9    6   13   22    9    9    4   14   11    8 
 £25  £26 £262 £263  £27  £28  £29   £3  £30  £31  £32  £33  £34  £35  £36  £37 
  11   82    7    6   26    4   14    1   17   22    4    4    7    5    4    3 
 £38  £39   £4  £40  £41  £42  £45  £46  £47  £49   £5  £50  £51 £512  £52  £53 
   1   19    1    1    4    3    1    8    2    3    1    4    6    4   12    6 
 £55  £56  £57  £59   £6  £60  £61  £63  £65  £66  £69   £7  £71  £73  £75  £76 
   8   10    6    4   18    2    6    3    5    2   13  324    4    7    4    5 
 £77  £78  £79   £8  £80  £81  £82  £83  £86  £89   £9  £90  £91  £93    0 
   5    5    8   99    3    3    4    8    3    2   32    3    2    4   11 

2nd field   table(p2)

p2
    10s 11d 11s 12s 13s 14s 15s 16s 17s 18s 19s  1d  1s  2s  3d  3s  4d  4s  5d 
230 155   4  61  10  51  46 123  18 117  52  14   3  81  44   1  24   2  70   1 
 5s 60s  6d  6s  7d  7s  8d  8s  9s 
 65   1   3  14   1  35   1  28  54 

3rd field   table(p3)

p3
    10d 11d 15d 17d  1d  2d  3d  4d  5d  6d  7d  8d  9d 
727  22  88   1   1  44  35  41  21  26 152  78  28  45 

First letter of 1st field   table(substring(p1, 1, 1))


   £    0 
1291   11 

Last letter of 2nd field   table(substring(p2, nchar(p2)))

        d    s 
 230   16 1063 

Last letter of 3rd field   table(substring(p3, nchar(p3)))

      d 
727 582 
          price      pdec
 1: £211 60s 9d 214.03750
 2:    £151 16s 151.80000
 3:    £151 16s 151.80000
 4:    £151 16s 151.80000
 5:    £151 16s 151.80000
 6:     £26 11s  26.55000
 7:  £77 19s 2d  77.95833
 8:   £51 9s 7d  51.47917
 9:  £49 10s 1d  49.50417
10: £247 10s 6d 247.52500

The printed output resulting from check=TRUE verifies our data and logic.

2.5 Annotate Dataset

Next bring the external data dictionary into the main data table. The data dictionary did not define units of measurement so we will later add those manually. We start by just adding variable labels to the variables after checking that all the variables in the data have variable names in the data dictionary. We override the label for age that was constructed from the variables from which it was derived, override the labels for sex, price and dob, and add some units of measurement. The data dictionary is displayed in html format using the Hmisc package contents function.

all(names(d) %in% dd$Variable)    # dd$Variable extracts column "Variable" from dd

[1] TRUE

labs          <- dd$Description
names(labs)   <- dd$Variable
labs['age']   <- 'Age'            # named vector can be addressed with names
labs['price'] <- 'Ticket price'   # instead of just integer subscripts
labs['dob']   <- 'Date of birth'
labs['sex']   <- 'Sex'
d <- upData(d, labels=labs,
            units=c(age='years', price='£'))

Input object size: 490944 bytes; 17 variables 1309 observations New object size: 498216 bytes; 17 variables 1309 observations

html(contents(d))
d Contents

Data frame:d

1309 observations and 17 variables, maximum # NAs:1257  
NameLabelsUnitsClassStorageNAs
name_idUnique ID for Passenger / Crew memberintegerinteger 0
nameFull Name (LAST, Title First)charactercharacter 0
sexSexcharactercharacter 0
ageAgeyearsnumericdouble 2
classClass of Passenger or Department of Crew - 1,2,3,B,D,E,R,V (see key)integerinteger 0
ticketTicket numbercharactercharacter 0
joinedPort of embarcationcharactercharacter 0
occupationJob / Careercharactercharacter 621
boat..body.Boat (rescued survivor), Body (identified victim)charactercharacter 697
priceTicket price£numericdouble 7
jobSecond field of career info / company, role, variouscharactercharacter1257
survivedSurvived indicatorintegerinteger 0
urlEncylopedia Titanic URL filename (starts with http://www.encyclopedia-titanica.org)charactercharacter 0
date_deathDate of deathcharactercharacter 49
dobDate of birthDatedouble 229
sibspNumber of Siblings/Spouses Aboard - obtained from familiar "Titanic3" dataset, merged via "Name_ID" varibaleintegerinteger 0
parchNumber of Parents/Children Aboard - obtained from familiar "Titanic3" dataset, merged via "Name_ID" varibaleintegerinteger 0

The data dictionary can be displayed in the RStudio Viewer window using the reptools repository function htmlView as shown below.

getRs('reptools.r')    # Loads reptools.r from Github
htmlView(contents(d))

2.6 Recoding and Annotation in One Step

Instead of using external metadata and a sequence of recoding steps, take advantage of pounds2dec existing as a self-contained function and use the Hmisc upData function to do everything in one step. Go back to the original imported data table t5. We won’t bother to provide labels to all the variables; just the ones we’ll analyze later, and date of birth.

d <- copy(t5)
d <- upData(d,
  rename = .q(dob_clean = dob),
  dob    = as.Date(dob, '%m/%d/%Y'),
  age    = as.numeric((as.Date('1912-04-15') - dob) / 365.25),
  age    = ifelse(is.na(age), age_f, age),
  price  = pounds2dec(price),
  drop   = .q(age_f, age_f_code),
  labels = .q(age      = Age,
              price    = 'Ticket Price',
              dob      = 'Date of Birth',
              sex      = Sex,
              class    = Class,
              survived = Survived),
  units = .q(age = years, price = '£') )
Input object size:   588360 bytes;   19 variables    1309 observations
Renamed variable     dob_clean  to dob 
Modified variable   dob
Modified variable   age
Modified variable   age
Modified variable   price
Dropped variables   age_f,age_f_code
New object size:    493272 bytes;   17 variables    1309 observations

2.7 Missing Data

The reptools missChk function is used to create a tabbed report summarizing extent and patterns of NAs. The Clustering of missingness tab shows that boat/body bag and job were mainly missing for the same passengers. In the NA combinations tab, hover over the dots to see a great deal of information.

missChk(d)

10 variables have no NAs and 7 variables have NAs

d has 1309 observations (27 complete) and 17 variables (10 complete)

Number of NAs
Minimum Maximum Mean
Per variable 0 1257 168.4
Per observation 0 5 2.2
Frequency distribution of number of NAs per variable
0 2 7 49 229 621 697 1257
10 1 1 1 1 1 1 1
Frequency distribution of number of incomplete variables per observation
0 1 2 3 4 5
27 174 693 359 55 1

Sequential frequency-ordered exclusions due to NAs
job boat..body. dob date_death
1257 19 5 1

2.8 Data Overview

The reptools dataOverview function provides a statistical snapshot of the variables in the data table.

dataOverview(d)

d has 1309 observations (27 complete) and 17 variables (10 complete)

Variable Type Distinct Info Symmetry NAs Rarest Value Frequency of Rarest Value Mode Frequency of Mode
name_id Continuous 1309 1.000 1.007 0 1 1 1 1
name Nonnumeric 1308 1.000 1.000 0 ABBING, Mr Anthony 1 KELLY, Mr James 2
sex Discrete 2 0.688 0.856 0 female 466 male 843
age Continuous 946 1.000 1.167 2 0.2 1 23 15
class Discrete 3 0.817 0.861 0 2 276 3 709
ticket Nonnumeric 926 1.000 1.000 0 10482 1 2343 11
joined Discrete 4 0.661 0.780 0 Belfast 10 Southampton 904
occupation Nonnumeric 153 0.987 0.993 621 Advertising Consultant 1 General Labourer 160
boat..body. Nonnumeric 149 0.998 0.991 697 [1] 1 C 41
price Continuous 282 1.000 6.726 7 3.171 1 8.05 60
job Discrete 3 0.430 0.673 1257 H&W Guarantee Group 9 Servant 43
survived Discrete 2 0.708 0.882 0 1 500 0 809
url Nonnumeric 1309 1.000 1.000 0 /titanic-survivor/abelson.html 1 /titanic-survivor/abelson.html 1
date_death Nonnumeric 452 0.735 0.997 49 1/1/1954 1 4/15/1912 809
dob Continuous 901 1.000 0.885 229 1837-05-17 1 1889-06-30 13
sibsp Discrete 7 0.670 0.818 0 5 6 0 891
parch Discrete 8 0.549 0.839 0 6 2 0 1002
0 [ 2, 49) 49 [ 229, 697) 697 1257
Intervals of frequencies of NAs used for color-coding plots

2.9 Basic Descriptive Statistics

2.10 Univariate and Bivariate Descriptions

The first line of defense is the Hmisc describe function, where the result is rendered as html. This allows inclusion of micro-graphics which are “spike histograms” providing a high-resolution display that can reveal data errors, bimodality, digit preference, and other phenomena.

des <- describe(d)
html(des)
d Descriptives
d

17 Variables   1309 Observations

name_id: Name_ID
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
13090130911092730.3 87.4 222.8 547.01095.01627.01963.22086.6
lowest : 1 3 4 5 6 , highest: 2202 2203 2204 2207 2208
name
nmissingdistinct
130901308
lowest :ABBING, Mr Anthony ABBOTT, Mr Eugene Joseph ABBOTT, Mr Rossmore Edward ABBOTT, Mrs Rhoda Mary 'Rosa' ABELSETH, Miss Karen Marie
highest:YVOIS, Miss Henriette ZAKARIAN, Mr Mapriededer ZAKARIAN, Mr Ortin ZENNI, Mr Philip ZIMMERMANN, Mr Leo

sex: Sex
nmissingdistinct
130902
 Value      female   male
 Frequency     466    843
 Proportion  0.356  0.644
 

age: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
13072945129.8915.49 6.5315.7721.0027.8538.7648.8156.02
lowest : 0.1998631 0.4134155 0.4353183 0.6023272 0.6297057
highest:70.151950771.493497671.748117771.789185574.9103354

class: Class
image
nmissingdistinctInfoMeanGmd
1309030.8172.2940.8697
 Value          1     2     3
 Frequency    324   276   709
 Proportion 0.248 0.211 0.542
 

ticket
nmissingdistinct
13090926
lowest :10482 110152 110413 110465 110469
highest:CA2673 PC 14474 S.O./C. 3 S.O./P.P 3SW/PP 751

joined
image
nmissingdistinct
130904
 Value          Belfast   Cherbourg  Queenstown Southampton
 Frequency           10         272         123         904
 Proportion       0.008       0.208       0.094       0.691
 

occupation
nmissingdistinct
688621152
lowest :Advertising Consultant Apprentice Electrician Architect Artist Aviator
highest:Upholsterer Valet Waiter Wood Carver Writer

boat..body.
nmissingdistinct
612697148
lowest : [1] [101] [103] [108] [109] , highest: A[331] A[332] B C D
price: Ticket Price £
image
        n  missing distinct     Info     Mean      Gmd      .05      .10      .25 
     1302        7      281        1    33.47    38.75    7.225    7.650    7.896 
      .50      .75      .90      .95 
   14.454   31.275   77.958  132.968 
 
lowest : 0.000000 3.170833 4.012500 5.000000 6.237500
highest:247.520833247.525000262.375000263.000000512.329167

job
nmissingdistinct
5212572
 Value      H&W Guarantee Group             Servant
 Frequency                    9                  43
 Proportion               0.173               0.827
 

survived: Survived
nmissingdistinctInfoSumMeanGmd
1309020.7085000.3820.4725

url
nmissingdistinct
130901309
lowest :/titanic-survivor/abelson.html /titanic-survivor/abraham-hyman.html /titanic-survivor/abraham-salomon.html /titanic-survivor/ada-ball.html /titanic-survivor/ada-doling.html
highest:/titanic-victim/yoto-danoff.html /titanic-victim/yousif-ahmed-wazli.html /titanic-victim/youssef-joseph-samaan.html /titanic-victim/yousseff-ibrahim-shawah.html/titanic-victim/zahie-maria-khalil.html

date_death
nmissingdistinct
126049451
lowest :1/1/1954 1/1/1975 1/10/19801/11/19461/11/1964
highest:9/4/1949 9/6/1975 9/7/1928 9/7/1965 9/9/1943

dob: Date of Birth
image
          n    missing   distinct       Info       Mean        Gmd        .05 
       1080        229        900          1 1881-12-22       5960 1854-06-07 
        .10        .25        .50        .75        .90        .95 
 1862-06-30 1872-09-23 1883-11-05 1891-06-30 1898-10-25 1907-06-30 
 
lowest :1837-05-171840-06-301840-07-151840-10-161842-02-18
highest:1911-08-291911-09-081911-11-081911-11-161912-02-02

sibsp
image
nmissingdistinctInfoMeanGmd
1309070.670.49890.777
lowest : 0 1 2 3 4 , highest: 2 3 4 5 8
 Value          0     1     2     3     4     5     8
 Frequency    891   319    42    20    22     6     9
 Proportion 0.681 0.244 0.032 0.015 0.017 0.005 0.007
 

parch
image
nmissingdistinctInfoMeanGmd
1309080.5490.3850.6375
lowest : 0 1 2 3 4 , highest: 3 4 5 6 9
 Value          0     1     2     3     4     5     6     9
 Frequency   1002   170   113     8     6     6     2     2
 Proportion 0.765 0.130 0.086 0.006 0.005 0.005 0.002 0.002
 

The spike histogram for name_id is uneven because of rounding the IDs, which are not consecutive integers.

describe also has a plot method. For continuous variables the new information it adds to the small spike histograms is color coding for the number of missing values, and a detailed overall statistical summary. For categorical variables frequencies are displayed using a dot chart. The plots are rendered as interactive graphics using plotly. Hover over points to see more details.

options(grType='plotly')   # makes plot() use plotly
p <- plot(des)   # stores two plot objects in p
p$Categorical    # picks off the first one

p$Continuous

2.10.1 Bivariate Descriptions

Let’s describe two-way relationships between some of the passenger characteristics. In particular we’d like to know the age and ticket price distribution by sex and passenger class. But start with a simple dot chart showing the proportion of males by passenger class. Plots are interactive, rendered by plotly.

plot(summaryM(sex ~ class, data=d))

Now draw extended box plots. Hovering over elements of the boxes reveals their meaning.

# Note the use of variable labels and units
plot(summaryM(age + price ~ sex, data=d))

plot(summaryM(age + price ~ class, data=d))

First-class passengers are older. They have accumulated more money to be able to purchase the high-priced tickets. There is substantial variation in ticket prices only for first-class passengers, and a few first-class passengers paid less than a few third-class passengers.

Make a scatterplot of age vs. ticket price, and add a nonparametric smoother to the plot. The ggplot2 package is used. class is treated as a categorical variable using factor(class). price is plotted on a square root scale to make lower ranges more visible. Labels and units are extracted from the data table so that we can use them in the plot.

# ggplot2 complained about class having a label, so we create a new
# data.table where it doesn't
w <- copy(d)
w[, class := as.integer(class)]
# Extract a named vector of var. labels specially formatting units
vlabs <- sapply(d, function(x) label(x, plot=TRUE, html=FALSE))
# vlab looks up in vlabs without requiring quotes
vlab  <- function(x) vlabs[[as.character(substitute(x))]]
ggplot(w, aes(x=age, y=price, color=factor(class))) +
  geom_point() + geom_smooth() +
  scale_y_continuous(trans='sqrt') +
  guides(color=guide_legend(title='Class')) +
  xlab(vlab(age)) + ylab(vlab(price))

2.11 Associations of Passenger Characteristics With Survival

In the Titanic disaster, survival is synonymous with getting chosen for a lifeboat. Detailed descriptive and formal analysis of such associations appeared in RMS using the older titanic3 dataset, which had a good many missing ages, and the ticket price was not converted to analyzable form so it was ignored. Here we update the descriptive analyses using titanic5. One of the key new questions relates to passenger class and ticket price. It was demonstrated with titanic3 that the chance of surviving declined steadily with increasing passenger class. Third class passengers were situated in the ship’s hold, making the trip to the deck to attempt to get on a lifeboat more difficult. Let’s see if the ticket price contains further information about survival tendencies once we control for passenger class.

Start by estimating the probability of survival as a function of inherently discrete variables sex and passenger class. Separately we compute proportions surviving by sex, then by class, then by both, using basic data.table package operations. Then the data.table cube function is used to do all of that at once.

# Create a function that drops NAs when computing the mean
# Note that the mean of a 0/1 variable is the proportion of 1s
mn <- function(x) mean(x, na.rm=TRUE)
# Create a function that counts the number of non-NA values
Nna <- function(x) sum(! is.na(x))
# This is for generality; there are no NAs in these examples
d[, .(Proportion=mn(survived), N=Nna(survived)), by=sex]    # .N= # obs in by group
      sex Proportion   N
1: female  0.7274678 466
2:   male  0.1909846 843
d[, .(Proportion=mn(survived), N=Nna(survived)), by=class]
   class Proportion   N
1:     1  0.6203704 324
2:     2  0.4275362 276
3:     3  0.2552891 709
d[, .(Proportion=mn(survived), N=Nna(survived)), by=.(sex,class)]
      sex class Proportion   N
1: female     1  0.9652778 144
2:   male     1  0.3444444 180
3:   male     2  0.1411765 170
4: female     2  0.8867925 106
5:   male     3  0.1521298 493
6: female     3  0.4907407 216
cube(d, .(Proportion=mn(survived), N=Nna(survived)), by=.q(sex, class), id=TRUE)
    grouping    sex class Proportion    N
 1:        0 female     1  0.9652778  144
 2:        0   male     1  0.3444444  180
 3:        0   male     2  0.1411765  170
 4:        0 female     2  0.8867925  106
 5:        0   male     3  0.1521298  493
 6:        0 female     3  0.4907407  216
 7:        1 female    NA  0.7274678  466
 8:        1   male    NA  0.1909846  843
 9:        2   <NA>     1  0.6203704  324
10:        2   <NA>     2  0.4275362  276
11:        2   <NA>     3  0.2552891  709
12:        3   <NA>    NA  0.3819710 1309

This can be done more cleanly if we create a function that computes both the proportion surviving and the number of non-NAs in survival status.

g <- function(x) list(Proportion=mean(x, na.rm=TRUE), N=sum(! is.na(x)))
d[, g(survived), by=sex]
      sex Proportion   N
1: female  0.7274678 466
2:   male  0.1909846 843
d[, g(survived), by=class]
   class Proportion   N
1:     1  0.6203704 324
2:     2  0.4275362 276
3:     3  0.2552891 709
d[, g(survived), by=.(sex, class)]
      sex class Proportion   N
1: female     1  0.9652778 144
2:   male     1  0.3444444 180
3:   male     2  0.1411765 170
4: female     2  0.8867925 106
5:   male     3  0.1521298 493
6: female     3  0.4907407 216
cube(d, g(survived), by=.q(sex, class))
       sex class Proportion    N
 1: female     1  0.9652778  144
 2:   male     1  0.3444444  180
 3:   male     2  0.1411765  170
 4: female     2  0.8867925  106
 5:   male     3  0.1521298  493
 6: female     3  0.4907407  216
 7: female    NA  0.7274678  466
 8:   male    NA  0.1909846  843
 9:   <NA>     1  0.6203704  324
10:   <NA>     2  0.4275362  276
11:   <NA>     3  0.2552891  709
12:   <NA>    NA  0.3819710 1309

The reptools repository’s movStats function can also compute summaries for discrete predictors as can a host of other functions.

getRs('movStats.r')
movStats(survived ~ sex,   discrete=TRUE, data=d)
      sex Proportion   N
1: female  0.7274678 466
2:   male  0.1909846 843
movStats(survived ~ class, discrete=TRUE, data=d)
   class Proportion   N
1:     1  0.6203704 324
2:     2  0.4275362 276
3:     3  0.2552891 709
movStats(survived ~ sex + class, discrete=TRUE, data=d)
      sex Proportion   N class
1: female  0.9652778 144     1
2:   male  0.3444444 180     1
3: female  0.8867925 106     2
4:   male  0.1411765 170     2
5: female  0.4907407 216     3
6:   male  0.1521298 493     3

Now we use movStats for what it is mainly intended to do: estimate the relationship between a continuous X and a response Y, without assuming a functional form. movStats does that by using overlapping moving windows, which is a quite general approach that can estimate means (which are proportions when Y=0/1), medians and other quantiles, measures of variability, censored survival estimates, and any other quantities for which the user provides a basic estimation function. movStats can also estimate smooth relationships using parametric, semiparametric, and non-parametric regression directly on the raw data, for the case where means or proportions are involved. While doing so we also stratify on discrete passenger characteristics. Start by only computing only moving proportions, which are estimates of the probability of survival.

# base=9.5 uses more smoothing than the default
# pr='margin' makes movStats put information about moving windows
# in the right margin, using Quarto markup
z <- movStats(survived ~ price + class, bass=9.5, data=d, pr='margin')
Window Sample Sizes
N Mean Min Max xinc
322 30.9 25 31 1
276 30.8 25 31 1
704 30.9 25 31 3
ggplot(z, aes(x=price, y=`Moving Proportion`, col=factor(class))) +
  geom_line() + guides(color=guide_legend(title='Class')) +
  xlab(vlab(price)) + ylab('Survival')

It appears that the chance of getting on a lifeboat increased with increasing ticket price, for 1st and 2nd class.

In the next example we stratify by two variables, and in addition to the moving statistic we compute loess nonparametric estimates, and also use the gold standard smoother for Y=0/1: binary logistic regression (LR) without assuming linearity in the continuous predictor (age). A restricted cubic spline with 5 default knot locations is used to model the age effect, and the LR fits are run separately by the six age-class combinations. melt=TRUE tells movStats to “melt” the data table so that multiple types of estimates become multiple observations. This makes it easy to handle in ggplot2.

z <- movStats(survived ~ age + sex + class, bass=9.5,
              lrm=TRUE, loess=TRUE,
              data=d, melt=TRUE, pr='margin')
Window Sample Sizes
N Mean Min Max xinc
female::1 144 30.7 25 31 1
male::1 180 30.7 25 31 1
female::2 106 30.5 25 31 1
male::2 170 30.7 25 31 1
female::3 216 30.8 25 31 1
male::3 491 30.9 25 31 2
ggplot(z, aes(x=age, y=survived, col=class, linetype=Type)) +
  geom_line() + facet_wrap(~ sex) +
  xlab(vlab(age)) + ylab('Survival')

Repeat the last plot using a nonparametric smoother built-in to ggplot2, which also computes approximate 0.95 confidence bands.

ggplot(d, aes(x=age, y=survived, col=factor(class))) + geom_smooth() +
  facet_wrap(~ sex) +
  ylim(0, 1) + xlab(vlab(age)) + ylab('Survival') +
  guides(color=guide_legend(title='Class'))