titanic5
DatasetCreated by David Beltran del Rio March 2016.
This is the final (for now) version of my update to the Titanic data. I think it’s finally ready for publishing if you’d like. What I did was to strip all the passenger and crew data from the Encyclopedia Titanica (ET) web pages (excluding channel crossing passengers), create a unique ID for each passenger and crew member (Name_ID
), then (painstakingly and hopefully 100% correctly) match to your earlier titanic3
dataset, in order to compare the two and to get your sibsp
and parch
variables. Since the ET is updated occasionally the work put into the ID and matching can be reused and refined later. I did eventually hear back from the ET people, they are willing to make the underlying database available in the future, I have not yet taken them up on it.
The two datasets line up nicely, most of the differences in the newer titanic5
dataset are in the age variable, as I had mentioned before - the new set has less missing ages - 51 missing (vs 263) out of 1309.
I am in the process of refining my analysis of the data as well, based on your comments below and your Regression Modeling Strategies example.
titanic3_wID
data can be matched to titanic5
using the Name_ID
variable. Tab titanic5 Metadata
has the variable descriptions and allowable values for Class
and Class/Dept
.
A note about the ages - instead of using the add 0.5 trick to indicate estimated birth day / date I have a flag that indicates how the “final” age (Age_F
) was arrived at. It’s the Age_F_Code
variable - the allowable values are in the Titanic5_metadata
tab in the attached excel. The reason for this is that I already had some fractional ages for infants where I had age in months instead of years and I wanted to avoid confusion for 6 month old infants, although I don’t think there are any in the data! Also, I was thinking to make fractional ages or age in days for all passengers for whom I have DoB, but I have not yet done so.
Here’s what the tabs are:
Titanic5_all
- all (mostly cleaned) Titanic passenger and crew recordsTitanic5_work
- working dataset, crew removed, unnecessary variables removed - this is the one I import into SAS / R to work onTitanic5_metadata
- Variable descriptions and allowable valuestitanic3_wID
- Original Titanic3
dataset with Name_ID
added for merging to Titanic5
I have a csv
, R dataset, and SAS dataset, but the variable names are an older version, so I won’t send those along for now to avoid confusion.
If it helps send my contact info along to your student in case any questions arise. Gmail address probably best, on weekends for sure: davebdr@gmail.com
The tabs in titanic5.xls
are
Titanic5_all
Titanic5_passenger
(the one to be used for analysis)Titanic5_metadata
(used during analysis file creation)Titanic3_wID