flowchart LR Im[Import] --> vn[Manageable<br>Variable<br>Names] --> An[Annotate<br>Add Metadata] An --> Int["Internal<br>(imported labels)"] & Ext[External] An --> lu[Labels<br>Units] An --> V["View Data Dictionary<br>(metadata)"] Im --> csv[CSV] & bin[Binary] Im --> mf[Multiple Files<br>Looping]
5 Analysis File Creation
5.1 Note on File Directory Structure
I have a directory for each major project, and put everything in that directory (including data files) except for graphics files for figures, which are placed in their own subdirectory underneath the project folder. The directory name for the project includes key identifying information, and files within that directory do not contain project information in their names, nor do they contain dates, unless I want to freeze an old version of an analysis script.
Quarto
I specify that html
files are to be self-contained, so there are no separate graphics files.For multi-analyst projects or ones in which you want to capture the entire code history, having the project on github
is worthwhile.
5.2 Importing and Creating Annotated Analysis Files
I typically create a compact analysis file in a separate R script called create.r
and have it produce compressed R binary data.frame
or data.table
.rds
files using saveRDS(name, 'name.rds', compress='xz')
. Then I have an analysis script named for example a.qmd
(for Quarto reports) or a.Rmd
(for RMarkdown reports) that starts with d <- readRDS('name.rds')
. This process is especially helpful with analysis file creation takes multiple steps or files are large. readRDS
is very fast for loading binary R objects, and qread
is even faster.
saveRDS
using the qs
package: qsave(name, 'name.qs')
and d <- qread('name.qs')
.Templates for analysis reports are here, and a comprehensive report example may be found here.There is an older way to read/write binary R objects using R
load
and save
and Hmisc
frontends for them Load
and Save
. Files using these formats typically have suffixes of .rda
, .RData
, or .sav
, and using load
stores the object under its original name, which might not be clear until you check your R .global
environment by looking in the upper right RStudio
panel.When variables need to be recoded, have labels added or changed, or have units of measurement added, I specify those using the Hmisc
package upData
function.
describe
and contents
function outputs below and in axis labels for graphics.To facilitate some operations requiring variable names to be quoted, define a function .q
to quote them automatically. .q
is like the Hmisc
function Cs
but also allows elements to be named. It will be in Hmisc
4.7-1.
<- function(...) {
.q <- sys.call()[-1]
s <- as.character(s)
w <- names(s)
n if(length(n)) names(w) <- n
w
}
.q(a, b, c, 'this and that')
[1] "a" "b" "c" "this and that"
.q(dog=a, giraffe=b, cat=c)
dog giraffe cat
"a" "b" "c"
Here is an upData
example:
# Function to recode from atypical coding for yes/no in raw data
<- function(x) factor(x, 0:1, c('yes', 'no'))
yn <-
d upData(d,
rename = .q(gender=sex, any.event=anyEvent),
posSE = yn(posSE),
newMI = yn(newMI),
newPTCA = yn(newPTCA),
newCABG = yn(newCABG),
death = yn(death),
hxofHT = yn(hxofHT),
hxofDM = yn(hxofDM),
hxofCig = factor(hxofCig, c(0, 0.5, 1),
c('heavy', 'moderate', 'non-smoker')),
hxofMI = yn(hxofMI),
hxofPTCA = yn(hxofPTCA),
hxofCABG = yn(hxofCABG),
chestpain= yn(chestpain),
anyEvent = yn(anyEvent),
drop=.q(event.no, phat, mics, deltaEF,
newpkmphr, gdpkmphr, gdmaxmphr, gddpeakdp, gdmaxdp,
hardness),labels=c(
bhr = 'Basal heart rate',
basebp = 'Basal blood pressure',
basedp = 'Basal Double Product bhr*basebp',
age = 'Age',
pkhr = 'Peak heart rate',
sbp = 'Systolic blood pressure',
dp = 'Double product pkhr*sbp',
dose = 'Dose of dobutamine given',
maxhr = 'Maximum heart rate',
pctMphr = 'Percent maximum predicted heart rate achieved',
mbp = 'Maximum blood pressure',
dpmaxdo = 'Double product on max dobutamine dose',
dobdose = 'Dobutamine dose at max double product',
baseEF = 'Baseline cardiac ejection fraction',
dobEF = 'Ejection fraction on dobutamine',
chestpain = 'Chest pain',
ecg = 'Baseline electrocardiogram diagnosis',
restwma = 'Resting wall motion abnormality on echocardiogram',
posSE = 'Positive stress echocardiogram',
newMI = 'New myocardial infarction',
newPTCA = 'Recent angioplasty',
newCABG = 'Recent bypass surgery',
hxofHT = 'History of hypertension',
hxofDM = 'History of diabetes',
hxofMI = 'History of myocardial infarction',
hxofCig = 'History of smoking',
hxofPTCA = 'History of angioplasty',
hxofCABG = 'History of coronary artery bypass surgery',
anyEvent = 'Death, newMI, newPTCA, or newCABG'),
units=.q(age=years, bhr=bpm, basebp=mmHg, basedp='bpm*mmHg',
pkhr=mmHg, sbp=mmHg, dp='bpm*mmHg', maxhr=bpm,
mbp=mmHg, dpmaxdo='bpm*mmHg', baseEF='%', dobEF='%',
pctMphr='%', dose=mg, dobdose=mg)
)
saveRDS(d, 'stressEcho.rds', compress='xz')
# Or qsave(d, 'stressEcho.qs')
# Note that we could have automatically recoded all 0:1 variables
# if they were all to be treated identically:
for(x in names(d))
if(all(d[[x]] %in% c(0,1))) d[[x]] <- yn(d[[x]])
Sometimes metadata comes from a separate source. Suppose you imported a data frame d
but have also imported a data frame m
containing metadata: the same variable names in d
(variable name
) plus fields label
, units
, and comment
. Dataset m
can contain variables not currently being used. To add the labels and units into d
and to store comments separately, use the following example.
<- names(d)
n <- n %nin% m$name
i if(any(i)) cat('The following variables have no metadata:',
paste(n[i], collapse=', '), '\n')
<- m$comment
vcomment names(vcomment) <- m$name
<- subset(m, name %in% n)
mn <- mn$label
labs <- mn$units
un names(labs) <- names(un) <- mn$name
<- upData(d, labels=labs, units=un) d
To look up the comment
for a variable at any time use e.g. vcomment['height']
. All comments were saved in vector vcomment
in case the metadata dictionary m
defined variables that are not in d
but were to be imported later in the script. Comments for multiple variables can be looked up using e.g. vcomment[.q(height, weight, age)]
.
If you want to look up a variable’s comment without having to quote its name use the following:
<- function(...)
vcom as.character(sys.call()[-1])]
vcomment[# Example usage: vcom(age,sbp,dbp)
The built-in function in R for reading .csv
files is read.csv
. The Hmisc
package has a function csv.get
which calls read.csv
but offers some enhancements in variable naming, date handling, and reading variable labels from a specified row number. Illegal characters in variable names are changed to periods, and by default underscores are also changed to periods. If any variable names are changed and the labels
argument is not given, original variable names are stored in the variable label
attributes.
The fread
function in the data.table
package is blazing fast for reading large files and offers a number of options. csv.get
uses it if data.table
is installed on the system.
If reading data exported from REDCap
that are placed into the project directory I run the following to get rid of duplicate (factor
and non-factor
versions of variables REDCap
produces) variables and automatically convert dates to Date
variables:
require(Hmisc)
require(qs)
getRs('importREDCap.r') # source() code to define function
mydata <- importREDCap() # by default operates on last downloaded export
qsave(mydata, 'mydata.qs')
When file names are not given to importREDCap
the function looks for the latest created .csv
file and .R
file with same prefix and uses those. See this for more information.
getRs('importREDCap.r')
also defines the cleanupREDCap
function serving two useful roles. It removes html code (tags) that sometimes clutters variable labels, and it converts sequences representing REDCap
multiple choice variables into a single multiple choice variable using the Hmisc
package mChoice
function to do the work. During the export process, REDCap
converts what appears to be one multiple choice entity during data entry into one variable per possible choice, and these variables have names of the form basename___choice number
(note the three underscores). Assuming that the checkboxLabels
argument to drinkREDCap
was set to TRUE
if exporting using the REDCap
API (see below), cleanupREDCap
puts these back into a single variable. Such an mChoice
variable has an internal value of, for example, 2;3;7
if choices 2, 3, and 7 were selected for a subject. These are converted to text form if you print the variable, run it through as.character()
or format()
, or if you run summary()
or summaryM()
on the variable. For example you may see output in the three choices example of "pain;stomach ache;headache"
. See mChoice
documentation.
To use cleanupREDCap
just run something like mydata <- cleanupREDCap(mydata)
. By default the function will print informative messages about the creation of mChoice
variables. When the new consolidated variables are created, the original choice variables created by REDCap
are removed from the data table. The new variable is the base name of these deleted variables.
The importREDCap
function deals with manually exported files. It would be better to create a reproducible automated workflow in which R makes a request to REDCap
and a “live” export of forms in the REDCap
database is run. This is facilitated by R packages redcapAPI
and rccola
. rccola
is a front-end to redcapAPI
that makes user authentication safe and easy. The process starts by logging into REDCap
and clicking on API
on the left panel to fetch your API key. Then when running the rccola
drinkREDCap
function to retrieve one or more files for the first time, rccola
will request your API key and ask you for a password. In later runs you will only be requested to provide that password to get access to REDCap
files. Those process ensures that sensitive API keys will not appear in your script, so the script can be freely shared or posted publicly.
There is one complication when using the REDCap
“longitudinal” model for the database. If you have more than one form, some forms are baseline-only forms, and some are longitudinal, importing all the files at once will result in the baseline-only forms being expanded with many duplicate records per subject. So baseline and longitudinal forms need to be imported separately. In the following example, all baseline-only files are imported at once, and these files are stored in an R list
object called B
. Then longitudinal files are imported, and stored in list L
.
The example code below handles this complication. Before doing that it retrieves all variable-level REDCap metadata for all forms and stores this in data frame meta
.
require(rccola)
require(Hmisc)
require(data.table)
getRS('importREDCap.r') # fetch cleanupREDCap function from Github
api_url<- 'https://redcap..../api/' # define your server URL
# Retrieve metadata for all REDCap forms, all variables
# Uses rccola package metaREDCap function
# Use key_get from the keyring package which is required by rccola
apikey <- keyring::key_get(service='rccola', username='eds', keyring='some project name')
meta <- metaREDCap(apikey, url=api_url)
head(meta)
# Data frame meta contains the following variables:
# field_name form_name section_header field_type field_label select_choices_or_calculations
# field_note text_validation_type_or_show_slider_number text_validation_min text_validation_max
# identifier branching_logic required_field custom_alignment question_number matrix_group_name
# matrix_ranking field_annotation
# Save date/time stamp for beginning of REDCap export
when <- Sys.time()
B <- new.env() # create an environment for drinkREDCap so we
# don't need to store retrieved files in our
# top-level environment
drinkREDCap('eds', keyring='your ring name', url=api_url,
envir=B, forms=.q(contact, demographics),
events='screening_arm_1', # restrict to baseline records
fields='record_id', # add this field to all files
checkboxLabels=TRUE) # allows mChoice to work
B <- as.list(B) # turn environment into ordinary list of data frames
# Remove .eds from starts of dataset names
names(B) <- sub('^eds\\.', '', names(B))
sapply(B, dim) # show rows and columns retrieved per form
# Make all the baseline data frames data tables with record_id
# as the key. Remove 4 variables from them
for(n in names(B)) {
setDT(B[[n]], key='record_id')
B[[n]][, .q(redcap_event_name, redcap_repeat_instrument,
redcap_repeat_instance, redcap_survey_identifier) := NULL]
}
# Import longitudinal files
L <- new.env()
drinkREDCap(eds, keyring=..., url=...,
envir=L, forms=.q(followup, lab),
fields='record_id',
checkboxLabels=TRUE)
L <- as.list(L)
names(L) <- sub('^eds\\.', '', names(L))
sapply(L, dim)
# For all longitudinal forms restructure the event variable and convert
# to data.table with key of record_id and day
# Delete original variable along with redcap_survey_identifier
# Character variable day is made so that alphabetic sorting puts it
# in time order. This requires putting a leading zero in one-digit days
etrans <- function(event) {
event <- sub('_arm_1', '', event)
event[event == 'screening'] <- 'day_-1'
sub('day_([1-9])$', 'day_0\\1', event)
}
for(n in names(L)) {
L[[n]]$day <- etrans(L[[n]]$redcap_event_name)
setDT(L[[n]], key=.q(record_id, day))
L[[n]][, .q(redcap_event_name, redcap_survey_identifier) := NULL]
}
L[[1]][, table(day)]
# Put baseline and longitudinal data tables in one large R object
R <- c(B, L)
# Remove html tags from variable labels and convert all sequences
# of multiple choices into single multiple choice variables using
# Hmisc::mChoice
R <- lapply(R, cleanupREDCap)
# Save entire database that is in one efficient R object to a
# binary file for fast retrieval
# Add export_time attribute to R to define export date/time
attr(R, 'export_time') <- when
qsave(R, 'R.qs') # 5.8MB 0.2s
# saveRDS(..., compress='xz') resulted in 3MB R.rds file, 10.7s
# qsave(R, 'R.qs', preset='archive') resulted in 3.8M, 2.8s
Subsequent programs may retrieve the database using R <- qread('R.qs')
and access the export date/time stamp using attr(R, 'export_time')
.
See the Multiple Files
tab for useful R examples for processing multiple files inside objects like the R
list just created.
Here are some examples showing how to use mChoice
variables.
r <- R$demographics$race_eth
levels(r) # show all choice descriptions
summary(r)
# Get frequency table of all combinations, numeric form
table(r)
# Same but with labels
table(as.character(r))
# Count number of persons with 'American Indian or Alaska Native'
# as a choice (this was choice # 1)
sum(inmChoice(r, 1))
sum(inmChoice(r, 'American Indian or Alaska Native'))
# Count number of persons in either of two categories
sum(inmChoice(r, c('American Indian or Alaska Native', 'White')))
sum(inmChoice(r, 'White'))
# Count number in both categories
sum(inmChoice(r, c('American Indian or Alaska Native', 'White'),
condition='all'))
# This is the same as:
sum(inmChoice(r, 'American Indian or Alaska Native') &
inmChoice(r, 'White'))
# Create a new derived variable for the latter
R$demographics[, aiw := inmChoice(race_eth, c('American Indian or Alaska Native', 'White'),
condition='all')]
SAS, Stata, and SPSS binary files are converted to R data.frame
s using the R haven package. Here’s an example:
require(haven) # after you've installed the package
<- read_sas('mydata.sas7bdat')
d <- read_xpt('mydata.xpt') # SAS transport files
d <- read_dta('mydata.dta') # Stata
d <- read_sav('mydata.sav') # SPSS
d <- read_por('mydata.por') # Older SPSS files d
These import functions carry variable labels into the data frame and convert dates and times appropriately. Character vectors are not converted to factors.
The readxl
package that comes with R reads binary Excel spreadsheet .xls
and .xlsx
files to produce something that is almost a data frame. Here is an example where the imported dataset is converted into a plain data.frame
and the variable names are changed to all lower case. read_excel
does not handle variable labels. You can specify which sheet or which cell range to import if desired.
<- readxl::read_excel('my.xlsx')
d # Or use require(readxl) then d <- read_excel(...)
<- as.data.frame(d)
d # Or better: make it a data table: setDT(d)
names(d) <- tolower(names(d))
One of the most important principles to following in programming data analyses is to not do the same thing more than once. Repetitive code wastes time and is harder to maintain. One example of avoiding repetition is in reading a large number of files in R. If the files are stored in one directory and have a consistent file naming scheme (or you want to import every .csv
file in the directory), one can avoid naming the individual files. The results may be stored in an R list
that has named elements, and there are many processing tasks that can be automated by looping over this list.
In the following example assume that all the data files are .csv
files in the current working directory, and they all have names of the form xz*.csv
. Let’s read all of them and put each file into a data frame named by the characters in front of .csv
. These data frames are stuffed into a list
named X
. The Hmisc
csv.get
function is used to read the files, automatically translating dates to Date
variables, and because lowernames=TRUE
is specified, variable names are translated to lower case. There is an option to fetch variable labels from a certain row of each .csv
file but we are not using that.
<- list.files(pattern='xz.*.csv') # vector of qualifying file names
files # Get base part of *.csv
<- sub('.csv', '', files)
dnames <- list()
X <- 0
i for(f in files) {
cat('Reading file', f, '\n')
<- i + 1
i <- csv.get(f, lowernames=TRUE)
d # To read SAS, Stata, SPSS binary files use a haven function instead
# To convert to data.table do setDT(d) here
<- d
X[[dnames[i]]]
}saveRDS(X, 'X.rds', compress='xz') # Efficiently store all datasets together
# Or qsave(X, 'X.qs')
To process one of the datasets one can do things like summary(X[[3]])
or summary(X$baseline)
where the third dataset stored was named baseline
because it was imported from baseline.csv
.
Now there are many possibilities for processing all the datasets at once such as the following.
names(X) # print names of all datasets
<- length(X) # number of datasets
k <- lapply(X, names) # list with k elements, each is a vector of names
nam # Get the union of all variable names used in any dataset
sort(unique(unlist(nam))) # all the variables appearing in any dataset
# Get variable names contained in all datasets
<- names(X[[1]])
common for(i in 2 : k) {
<- intersect(common, names(X[[i]]))
common if(! length(common)) break # intersection already empty
}sort(common)
# Compute number of variables across datasets
<- sapply(X, length) # or ncol
nvar # Print number of observations per dataset
sapply(X, nrow)
# For each variable name count the number of datasets containing it
<- data.table(dsname=rep(names(X), nvar), vname=unlist(nam))
w =vname]
w[, .N, keyby# For each variable create a comma-separated list of datasets
# containing it
datasets=paste(sort(dsname), collapse=', ')), keyby=vname]
w[, .(# For datasets having a subject ID variable named id compute
# the number of unique ids
<- function(d) if('id' %in% names(d)) uniqueN(d$id) else NA
uid sapply(X, uid)
# To repeatedly analyze one of the datasets, extract it to a single data frame
<- X$baseline
d describe(d)
Most of these steps are automated in the reptools
repository multDataOverview
function.
getRs('reptools.r') # if not done already
<- multDataOverview(X, ~ id) # Leave off m <- if not needing to print m
m # To print a long table showing all the datasets containing each variable,
# print m
m
Sometimes one keeps multiple versions of the raw data to be imported inside the project folder. If you want to import the file that was last modified you can use the reptools
latestFile
function. You give it a pattern of file names to match. For example a pattern of .csv$
means “ending with .csv
” and a pattern of ^baseline
means “starting with baseline
”. A pattern of foo
means that foo
is contained anywhere in the file name. Here is an example.
# Look for file names starting with base, followed by any number of
# characters, and ending with .csv
<- latestFile('^base.*csv$')
f <- csv.get(f) d
The Hmisc
package cleanup.import
function improves imported data storage in a number of ways including converting double precision variables to integer
when originally double
but not containing fractional values (this halves the storage requirement). Hmisc::upData
is the go-to function for annotating data frames/tables, renaming variables, and dropping variables. Hmisc::dataframeReduced
removes problematic variables, e.g., those with a high fraction of missing values or that are binary with very small prevalence.
5.3 Variable Naming
I prefer short but descriptive variable names. As exemplified above, I use variable label
s and units
to provide more information. For example I wouldn’t name the age variable age.at.enrollment.yrs
but would name it age
with a label of Age at Enrollment
and with units
of years
. Short, clear names unclutter code, especially formulas in statistical models. One can always fetch a variable label while writing a program (e.g., typing label(d$age)
at the console) to check that you have the right variable (or put the data dictionary in a window for easy reference, as shown below). Hmisc
package graphics and table making functions such as summaryM
and summary.formula
specially typeset units
in a smaller font.
5.4 Data Dictionary
The Hmisc
package contents
function will provide a concise data dictionary. Here is an example using the permanent version (which coded binary variables as 0/1 instead of N/Y) of the dataset created above, which can be accessed with the Hmisc
getHdata
function. The top of the contents
output has the number of levels for factor
variables hyperlinked. Click on the number to go directly to the list of levels for that variable.
require(Hmisc)
getHdata(stressEcho)
<- stressEcho d
html(contents(d), levelType='table')
Data frame:d
558 observations and 31 variables, maximum # NAs:0Name | Labels | Units | Levels | Storage |
---|---|---|---|---|
bhr | Basal heart rate | bpm | integer | |
basebp | Basal blood pressure | mmHg | integer | |
basedp | Basal Double Product bhr*basebp | bpm*mmHg | integer | |
pkhr | Peak heart rate | mmHg | integer | |
sbp | Systolic blood pressure | mmHg | integer | |
dp | Double product pkhr*sbp | bpm*mmHg | integer | |
dose | Dose of dobutamine given | mg | integer | |
maxhr | Maximum heart rate | bpm | integer | |
pctMphr | Percent maximum predicted heart rate achieved | % | integer | |
mbp | Maximum blood pressure | mmHg | integer | |
dpmaxdo | Double product on max dobutamine dose | bpm*mmHg | integer | |
dobdose | Dobutamine dose at max double product | mg | integer | |
age | Age | years | integer | |
gender | 2 | integer | ||
baseEF | Baseline cardiac ejection fraction | % | integer | |
dobEF | Ejection fraction on dobutamine | % | integer | |
chestpain | Chest pain | integer | ||
restwma | Resting wall motion abnormality on echocardiogram | integer | ||
posSE | Positive stress echocardiogram | integer | ||
newMI | New myocardial infarction | integer | ||
newPTCA | Recent angioplasty | integer | ||
newCABG | Recent bypass surgery | integer | ||
death | integer | |||
hxofHT | History of hypertension | integer | ||
hxofDM | History of diabetes | integer | ||
hxofCig | History of smoking | 3 | integer | |
hxofMI | History of myocardial infarction | integer | ||
hxofPTCA | History of angioplasty | integer | ||
hxofCABG | History of coronary artery bypass surgery | integer | ||
any.event | Death, newMI, newPTCA, or newCABG | integer | ||
ecg | Baseline electrocardiogram diagnosis | 3 | integer |
Variable | Levels |
---|---|
gender | male |
female | |
hxofCig | heavy |
moderate | |
non-smoker | |
ecg | normal |
equivocal | |
MI |
You can write the text output of contents
into a text file in your current working directory, and click on that file in the RStudio
Files
window to create a new tab in the editor panel where you can view the data dictionary at any time. This is especially helpful if you need a reminder of variable definitions that are stored in the variable labels. Here is an example where the formatted data dictionary is saved.
xless
system command installed can pop up a contents
window at any time by typing xless(contents(d))
in the console. xless
is in Hmisc
.capture.output(contents(d), file='contents.txt')
Or put the html version of the data dictionary into a small browser window to which you can refer at any point in analysis coding.
cat(html(contents(d)), file='contents.html')
browseURL('contents.html', browser='vivaldi -new-window')
RStudio
provides a nice way to do this, facilitated by the htmlView
helper function in reptools
. htmlView
takes any number of objects for which an html
method exists to render them. They are rendered in the RStudio
Viewer
pane. If you are running outside RStudio
, your default browser will be launched instead.
RStudio Viewer
will drop its arrow button making it impossible to navigate back and forth to different html
outputs.Code for
htmlView
and htmlViewx
may be viewed in reptools.r
.getRs('reptools.r')
# reptools.r defines htmlView, htmlViewx, kabl, maketabs, dataChk, ...
htmlView(contents(d))
In some cases it is best to have a browser window open to the full descriptive statistics for a data table/frame (see below; the describe
function also shows labels, units, and levels).
htmlView
.To be able to have multiple windows open to see information about datasets it is advantageous to open an external browser window. The htmlViewx
function will by default open a Vivaldi browser window with the first output put in a new window and all subsequent objects displayed as tabs within that same window. This behavior can be controlled with the tab
argument, and you can use a different browser by issuing for example options(vbrowser='firefox')
. As an example suppose that two datasets were read from the hbiostat.org/data
data repository, and the data dictionary and descriptive statistics for both datasets were to be converted to html
and placed in an external browser window for the duration of the R session.
firefox.exe
. The tab names will not be correct until Hmisc
4.7-1 appears.getHdata(support)
getHdata(support2)
htmlViewx(contents(support ), describe(support ),
contents(support2), describe(support2))
A screenshot of the result is here.
5.5 Efficient Storage and Retrieval With fst
The fst
package of Mark Klik provides an incredibly efficient way to write and read binary files holding R data frames or data tables. For very large files, fst
adds efficiency by providing fast direct access of selected rows and/or columns. The only downside is that fst
drops user-added attributes of columns, e.g. units
and label
. So we need to use the Hmisc
keepHattrib
and restoreHattrib
functions to save such attributes in a separate object and to apply them back to the data frame/table after running read_fst
. Here is an example for a 10,000,000 row data table with three columns.
fst
package.require(data.table)
require(fst)
<- data.table(a=runif(1e7), b=rnorm(1e7),
d f=factor(sample(c('a','b','c'), 1e7, replace=TRUE)))
<- upData(d, labels=c(a='A', b='B', f='F'), units=c(b='cm')) d
Input object size: 200002128 bytes; 3 variables 10000000 observations
New object size: 200003688 bytes; 3 variables 10000000 observations
<- keepHattrib(d)
at # The following took 1.5s (0.16s with default compress=50 but .fst file 1.5x larger)
write_fst(d, '/tmp/d.fst', compress=100) # maximum compression
saveRDS(at, '/tmp/at.rds', compress='xz') # not a data frame; can't use fst
<- read_fst('/tmp/d.fst', as.data.table=TRUE) # 0.06s (0.03s with compress=50)
dr <- readRDS('/tmp/at.rds')
atr contents(dr)
Data frame:dr 10000000 observations and 3 variables Maximum # NAs:0
Levels Storage
a double
b double
f 3 integer
+--------+------+
|Variable|Levels|
+--------+------+
| f | a,b,c|
+--------+------+
<- restoreHattrib(dr, atr)
dr contents(dr)
Data frame:dr 10000000 observations and 3 variables Maximum # NAs:0
Labels Units Levels Class Storage
a A numeric double
b B cm numeric double
f F 3 integer
+--------+------+
|Variable|Levels|
+--------+------+
| f | a,b,c|
+--------+------+
A pair of functions to automate this process is found in the saveRfiles
script on github
. Here is how this streamlines what is above. Let’s also show how to retrieve only a subset of the data table (two columns and only the first 1000 rows).
getRs('saveRfiles.r')
save_fst(d)
Data saved in d.fst and attributes saved in d-attr.rds
# In the next line the value of d is not used but d is used
# for the base file name
<- retrieve_fst(d, as.data.table=TRUE,
dr columns=.q(a, b), from=1, to=1000)
contents(dr)
Data frame:dr 1000 observations and 2 variables Maximum # NAs:0
Labels Units Class Storage
a A numeric double
b B cm numeric double
invisible(file.remove('d.fst', 'd-attr.rds'))
5.6 Efficient Storage and Retrieval With qs
The qs
package, like fst
, provides fast reading and writing of R objects. Unlike fst
, qs
can store any type of object, does not lose attributes, and does not allow for direct access of selected rows or columns. Let’s compare qs
, fst
, and saveRDS
for speed and disk storage efficiency. First, make sure that qs
keeps all attributes.
require(qs)
getHdata(support) # loads from hbiostat.org/data; has
# lots of variable attributes
qsave(support, '/tmp/support.qs')
<- qread('/tmp/support.qs')
w identical(support, w)
[1] TRUE
Now try some things on the 10,000,000 row data table created above.
saveRDS(d, '/tmp/d.rds') # 13.6s 128M
saveRDS(d, '/tmp/dc.rds', compress='xz') # 124s 116M
write_fst(d, '/tmp/d.fst') # 0.2s 161M
write_fst(d, '/tmp/dc.fst', compress=100) # 1.4s 122M
qsave(d, '/tmp/d.qs', nthreads=3) # 0.3s 110M
Read times are as follows.
readRDS('/tmp/d.rds') # 0.9 s
readRDS('/tmp/dc.rds') # 7.5 s
read_fst('/tmp/d.fst') # 0.04s
read_fst('/tmp/dc.fst') # 0.07s
qread('/tmp/d.qs') # 0.2 s
qread('/tmp/d.qs', nthreads=6) # 0.16s
When direct access is not needed, i.e., when whole data frames/tables are to be read, the qs
package would be a good default, especially if one takes into account that qs
preserves all object attributes. qs
dependencies are also more actively supported.