Datasets
1 Vanderbilt Biostatistics Datasets
Most of the datasets on this page are in the S dumpdata
and R compressed save()
file formats. Some are available in Excel and ASCII ( .csv
) formats and Stata (.dta
). If you need one of the datasets we maintain converted to a non-S format please e-mail Frank Harrell to make a request.
If you install the R Hmisc
package you can retrieve most of the datasets stored here using for example getHdata(titanic3)
.
Permission is granted to anyone wishing to use the data sets provided here. Please reference the original paper which, for most data sets, is given in our notes linked below, and note “Data obtained from http://hbiostat.org/data courtesy of the Vanderbilt University Department of Biostatistics.”
Note: To make csv
files from R save
files do the following:
load(url('https://hbiostat.org/data/repo/foo.sav'))
ls() # find name of data frame just loaded (here assumed 'foo')
write.table(d, file='foo.csv', sep=',', col.names=NA)
2 Other Datasets Available from the Web
- Google dataset search
- Data Sources on the Web
- Open Data Repositories
- Data science teaching datasets
- CRASH datasets (for CRASH-2 dataset see above)
- US military adult anthropometric data
- DryadLab
- International Stroke Trial dataset
- Physionet ICU data
- Pooled Resource Open-Access ALS Clinical Trials Database - contains high-quality data with time to event and ordinal scale outcomes. The data may be useful for assessing differential treatment effect (often called HTE - hetogeneous treatment effect) for Riluzole. The database was created by a non-profit organization, Prize4Life
- Clinical Study Data Request - a wealth of data from clinical trails done by the pharmaceutical industry
- GapMinder
- Data and Story Library from Carnegie Mellon University - This is a treasure trove of datasets. The data are found inside HTML documents, so you may wish to click on File & Save as with your browser to save the data into a plain text file. Once inside an editor, click on the data documentation and copy it to another file. Edit the resulting .txt file to leave only the data. Many of the datasets delimit the columns of data using tabs. R and S-Plus will readily import such data.
- Australasian Data and Story Library, containing a large number of interesting datasets, many pertaining to Australia
- Other datasets from the StatLib Repository at Carnegie Mellon University. The
Plasma_Retinol
dataset is available as an annotated Rsave
file or an S-Plus transport format dataset using thegetHdata
function in theHmisc
package - Datasets from the UCI Machine Learning Repository
- Datasets from the Dartmouth Chance data site
- Datasets from the University of Massachusetts Amherst
- Data from the Centers for Disease Control
- Data from the NIH. You have to request the data, but the site is immediately valuable as a source of data collection forms used in clinical (especially cardiovascular) studies.
- Data from NHLBI
- Data from the [Geospatial and Statistical Data Center]] of the University of Virginia. See especially the [http://fisher.lib.virginia.edu/ccdb City and County Data Books
- Data from the Peace Science Society, Penn State University (war casualties, etc.)
- Data from the U.S. Joint Global Ocean Flux Study
- Mike Dowling’s Interactive Table of World Nations, containing gross domestic product and several descriptor variables for all countries in the world.
- Data from the Consortium for International Earth Science Information Network Dataset Guide
- Data from Arizona Elementary School Districts
- Statistical Society of Canada’s archived case studies
- Datasets for research use from the National Heart, Lung, and Blood Institute of the U.S. National Institutes of Health
- A wonderful set of links to various dataset sources from Key Curriculum Press
- Links to other dataset repositories and tips on surfing the web for data, by Robin Lock, Mathematics Dept., St. Lawrence University
- Datasets from Exploring Data from Education Queensland
- Data from Statistical Science Web
- Datasets from Statistical Methods for the Analysis of Repeated Measurements by Charles S. Davis
- Datasets from the UCLA Department of Statistics
- Bradstreet Datasets from Early (and Late) Phases of Drug Research by Thomas E Bradstreet
- Datasets from Interactive and Dynamic Graphics for Data Analysis by Swayne, Cook, Buja, Hofmann, Lang
- Datasets from IBM’s Many Eyes visualization project
- Swivel
- Datasets from The Statistical Sleuth
- Name Voyager
3 Longitudinal Datasets
- Personality and Subjective Age: Evidence from Six Samples (Replication Package): https://hrsdata.isr.umich.edu/sites/default/files/documentation/other/1641492433/HRS_Replication__Package_Stephan_et_al.pdf
- National Longitudinal Survey of Youth 1997: https://dasil.sites.grinnell.edu/downloadable-data/
- National Longitudinal Survey of Youth (1997 – 2012) is a longitudinal project that follows a sample of American youth born between 1980-84 on various life aspects from 1997 to 2012. Download: CSV (41.0MB)
- Cebu Longitudinal Health and Nutrition Survey: https://dataverse.unc.edu/dataverse/cebu; Cohort Profile: The Cebu Longitudinal Health and Nutrition Survey
- The Wisconsin Longitudinal Study (WLS): https://www.ssc.wisc.edu/wlsresearch/
- National Longitudinal Study of Adolescent to Adult Health, 1994-2008: https://heardlibrary.github.io/digital-scholarship/script/r/nlsaah/
- The Add Health Study: Design
- Search studies associated with available BioLINCC resources: https://biolincc.nhlbi.nih.gov/studies/?q=longitudinal
- Longitudinal Studies of HIV-Associated Lung Infections and Complications (Lung HIV): https://biolincc.nhlbi.nih.gov/studies/lung_hiv/
- UK Data Service: https://beta.ukdataservice.ac.uk/datacatalogue/studies/?Search=Longitudinal#!?
- UA Little Rock, Publicly Available Data Sets: https://ualr.edu/irb/files/2019/07/Public-Use-Data-Sets.pdf
- NCHS Longitudinal Studies of Aging: https://www.cdc.gov/nchs/data_access/ftp_data.htm
Thanks to Drew Levy for compiling this list of longitudinal studies.