5  Analysis File Creation

flowchart LR
Im[Import] --> vn[Manageable<br>Variable<br>Names] --> An[Annotate<br>Add Metadata]
An --> Int["Internal<br>(imported labels)"] & Ext[External]
An --> lu[Labels<br>Units]
An --> V["View Data Dictionary<br>(metadata)"]
Im --> csv[CSV] & bin[Binary]
Im --> mf[Multiple Files<br>Looping]

5.1 Note on File Directory Structure

I have a directory for each major project, and put everything in that directory (including data files) except for graphics files for figures, which are placed in their own subdirectory underneath the project folder. The directory name for the project includes key identifying information, and files within that directory do not contain project information in their names, nor do they contain dates, unless I want to freeze an old version of an analysis script.

With Quarto I specify that html files are to be self-contained, so there are no separate graphics files.

For multi-analyst projects or ones in which you want to capture the entire code history, having the project on github is worthwhile.

5.2 Importing and Creating Annotated Analysis Files

I typically create a compact analysis file in a separate R script called create.r and have it produce compressed R binary data.frame or data.table .rds files using saveRDS(name, 'name.rds', compress='xz'). Then I have an analysis script named for example a.qmd (for Quarto reports) or a.Rmd (for RMarkdown reports) that starts with d <- readRDS('name.rds'). This process is especially helpful with analysis file creation takes multiple steps or files are large. readRDS is very fast for loading binary R objects, and qread is even faster.

See Section 5.6 for an alternative that is generally preferable to saveRDS using the qs package: qsave(name, 'name.qs') and d <- qread('name.qs').

Templates for analysis reports are here, and a comprehensive report example may be found here.
There is an older way to read/write binary R objects using R load and save and Hmisc frontends for them Load and Save. Files using these formats typically have suffixes of .rda, .RData, or .sav, and using load stores the object under its original name, which might not be clear until you check your R .global environment by looking in the upper right RStudio panel.

When variables need to be recoded, have labels added or changed, or have units of measurement added, I specify those using the Hmisc package upData function.

Variable labels and units of measurement are used in special ways in my R packages. This will show up in the describe and contents function outputs below and in axis labels for graphics.

To facilitate some operations requiring variable names to be quoted, define a function .q to quote them automatically. .q is like the Hmisc function Cs but also allows elements to be named. It will be in Hmisc 4.7-1.

.q <- function(...) {
  s <- sys.call()[-1]
  w <- as.character(s)
  n <- names(s)
  if(length(n)) names(w) <- n
  w
}

.q(a, b, c, 'this and that')
[1] "a"             "b"             "c"             "this and that"
.q(dog=a, giraffe=b, cat=c)
    dog giraffe     cat 
    "a"     "b"     "c" 

Here is an upData example:

# Function to recode from atypical coding for yes/no in raw data
yn <- function(x) factor(x, 0:1, c('yes', 'no'))
d <-
  upData(d,
         rename = .q(gender=sex, any.event=anyEvent),
         posSE    = yn(posSE),
         newMI    = yn(newMI),
         newPTCA  = yn(newPTCA),
         newCABG  = yn(newCABG),
         death    = yn(death),
         hxofHT   = yn(hxofHT),
         hxofDM   = yn(hxofDM),
         hxofCig  = factor(hxofCig, c(0, 0.5, 1),
                           c('heavy', 'moderate', 'non-smoker')), 
         hxofMI   = yn(hxofMI),
         hxofPTCA = yn(hxofPTCA),
         hxofCABG = yn(hxofCABG),
         chestpain= yn(chestpain),
         anyEvent = yn(anyEvent),
         drop=.q(event.no, phat, mics, deltaEF,
                 newpkmphr, gdpkmphr, gdmaxmphr, gddpeakdp, gdmaxdp,
                 hardness),
         labels=c(
           bhr       = 'Basal heart rate',
           basebp    = 'Basal blood pressure',
           basedp    = 'Basal Double Product bhr*basebp',
           age       = 'Age',
           pkhr      = 'Peak heart rate',
           sbp       = 'Systolic blood pressure',
           dp        = 'Double product pkhr*sbp',
           dose      = 'Dose of dobutamine given',
           maxhr     = 'Maximum heart rate',
           pctMphr   = 'Percent maximum predicted heart rate achieved',
           mbp       = 'Maximum blood pressure',
           dpmaxdo   = 'Double product on max dobutamine dose',
           dobdose   = 'Dobutamine dose at max double product',
           baseEF    = 'Baseline cardiac ejection fraction',
           dobEF     = 'Ejection fraction on dobutamine', 
           chestpain = 'Chest pain', 
           ecg       = 'Baseline electrocardiogram diagnosis',
           restwma   = 'Resting wall motion abnormality on echocardiogram', 
           posSE     = 'Positive stress echocardiogram',
           newMI     = 'New myocardial infarction',
           newPTCA   = 'Recent angioplasty',
           newCABG   = 'Recent bypass surgery', 
           hxofHT    = 'History of hypertension', 
           hxofDM    = 'History of diabetes',
           hxofMI    = 'History of myocardial infarction',
           hxofCig   = 'History of smoking',
           hxofPTCA  = 'History of angioplasty',
           hxofCABG  = 'History of coronary artery bypass surgery',
           anyEvent  = 'Death, newMI, newPTCA, or newCABG'),
         units=.q(age=years, bhr=bpm, basebp=mmHg, basedp='bpm*mmHg',
           pkhr=mmHg, sbp=mmHg, dp='bpm*mmHg', maxhr=bpm,
           mbp=mmHg, dpmaxdo='bpm*mmHg', baseEF='%', dobEF='%',
           pctMphr='%', dose=mg, dobdose=mg)
         )

saveRDS(d, 'stressEcho.rds', compress='xz')
# Or qsave(d, 'stressEcho.qs')

# Note that we could have automatically recoded all 0:1 variables
# if they were all to be treated identically:

for(x in names(d)) 
  if(all(d[[x]] %in% c(0,1))) d[[x]] <- yn(d[[x]])

Sometimes metadata comes from a separate source. Suppose you imported a data frame d but have also imported a data frame m containing metadata: the same variable names in d (variable name) plus fields label, units, and comment. Dataset m can contain variables not currently being used. To add the labels and units into d and to store comments separately, use the following example.

n <- names(d)
i <- n %nin% m$name
if(any(i)) cat('The following variables have no metadata:',
               paste(n[i], collapse=', '), '\n')
vcomment        <- m$comment
names(vcomment) <- m$name
mn              <- subset(m, name %in% n)
labs            <- mn$label
un              <- mn$units
names(labs)     <- names(un) <- mn$name
d <- upData(d, labels=labs, units=un)

To look up the comment for a variable at any time use e.g. vcomment['height']. All comments were saved in vector vcomment in case the metadata dictionary m defined variables that are not in d but were to be imported later in the script. Comments for multiple variables can be looked up using e.g. vcomment[.q(height, weight, age)].

If you want to look up a variable’s comment without having to quote its name use the following:

vcom <- function(...)
  vcomment[as.character(sys.call()[-1])]
# Example usage: vcom(age,sbp,dbp)

The built-in function in R for reading .csv files is read.csv. The Hmisc package has a function csv.get which calls read.csv but offers some enhancements in variable naming, date handling, and reading variable labels from a specified row number. Illegal characters in variable names are changed to periods, and by default underscores are also changed to periods. If any variable names are changed and the labels argument is not given, original variable names are stored in the variable label attributes.

The fread function in the data.table package is blazing fast for reading large files and offers a number of options. csv.get uses it if data.table is installed on the system.

If reading data exported from REDCap that are placed into the project directory I run the following to get rid of duplicate (factor and non-factor versions of variables REDCap produces) variables and automatically convert dates to Date variables:

require(Hmisc)
require(qs)
getRs('importREDCap.r')   # source() code to define function
mydata <- importREDCap()  # by default operates on last downloaded export
qsave(mydata, 'mydata.qs')

When file names are not given to importREDCap the function looks for the latest created .csv file and .R file with same prefix and uses those. See this for more information.

getRs('importREDCap.r') also defines the cleanupREDCap function serving two useful roles. It removes html code (tags) that sometimes clutters variable labels, and it converts sequences representing REDCap multiple choice variables into a single multiple choice variable using the Hmisc package mChoice function to do the work. During the export process, REDCap converts what appears to be one multiple choice entity during data entry into one variable per possible choice, and these variables have names of the form basename___choice number (note the three underscores). Assuming that the checkboxLabels argument to drinkREDCap was set to TRUE if exporting using the REDCap API (see below), cleanupREDCap puts these back into a single variable. Such an mChoice variable has an internal value of, for example, 2;3;7 if choices 2, 3, and 7 were selected for a subject. These are converted to text form if you print the variable, run it through as.character() or format(), or if you run summary() or summaryM() on the variable. For example you may see output in the three choices example of "pain;stomach ache;headache". See mChoice documentation.

To use cleanupREDCap just run something like mydata <- cleanupREDCap(mydata). By default the function will print informative messages about the creation of mChoice variables. When the new consolidated variables are created, the original choice variables created by REDCap are removed from the data table. The new variable is the base name of these deleted variables.

The importREDCap function deals with manually exported files. It would be better to create a reproducible automated workflow in which R makes a request to REDCap and a “live” export of forms in the REDCap database is run. This is facilitated by R packages redcapAPI and rccola. rccola is a front-end to redcapAPI that makes user authentication safe and easy. The process starts by logging into REDCap and clicking on API on the left panel to fetch your API key. Then when running the rccola drinkREDCap function to retrieve one or more files for the first time, rccola will request your API key and ask you for a password. In later runs you will only be requested to provide that password to get access to REDCap files. Those process ensures that sensitive API keys will not appear in your script, so the script can be freely shared or posted publicly.

There is one complication when using the REDCap “longitudinal” model for the database. If you have more than one form, some forms are baseline-only forms, and some are longitudinal, importing all the files at once will result in the baseline-only forms being expanded with many duplicate records per subject. So baseline and longitudinal forms need to be imported separately. In the following example, all baseline-only files are imported at once, and these files are stored in an R list object called B. Then longitudinal files are imported, and stored in list L.

The example code below handles this complication. Before doing that it retrieves all variable-level REDCap metadata for all forms and stores this in data frame meta.

require(rccola)
require(Hmisc)
require(data.table)
getRS('importREDCap.r')   # fetch cleanupREDCap function from Github
api_url<- 'https://redcap..../api/'  # define your server URL

# Retrieve metadata for all REDCap forms, all variables
# Uses rccola package metaREDCap function
# Use key_get from the keyring package which is required by rccola

apikey <- keyring::key_get(service='rccola', username='eds', keyring='some project name')
meta <- metaREDCap(apikey, url=api_url)
head(meta)
# Data frame meta contains the following variables: 
# field_name form_name section_header field_type field_label select_choices_or_calculations
# field_note text_validation_type_or_show_slider_number text_validation_min text_validation_max
# identifier branching_logic required_field custom_alignment question_number matrix_group_name
# matrix_ranking field_annotation

# Save date/time stamp for beginning of REDCap export
when <- Sys.time()

B <- new.env()    # create an environment for drinkREDCap so we
                  # don't need to store retrieved files in our
                  # top-level environment
drinkREDCap('eds', keyring='your ring name', url=api_url,
            envir=B, forms=.q(contact, demographics),
            events='screening_arm_1',  # restrict to baseline records
            fields='record_id',        # add this field to all files
            checkboxLabels=TRUE)       # allows mChoice to work
B <- as.list(B)   # turn environment into ordinary list of data frames
# Remove .eds from starts of dataset names
names(B) <- sub('^eds\\.', '', names(B))
sapply(B, dim)    # show rows and columns retrieved per form

# Make all the baseline data frames data tables with record_id
# as the key.  Remove 4 variables from them

for(n in names(B)) {
  setDT(B[[n]], key='record_id')
  B[[n]][, .q(redcap_event_name, redcap_repeat_instrument,
              redcap_repeat_instance, redcap_survey_identifier) := NULL]
}

# Import longitudinal files

L <- new.env()
drinkREDCap(eds, keyring=..., url=...,
            envir=L, forms=.q(followup, lab),
            fields='record_id',
            checkboxLabels=TRUE)
L <- as.list(L)
names(L) <- sub('^eds\\.', '', names(L))
sapply(L, dim)

# For all longitudinal forms restructure the event variable and convert
# to data.table with key of record_id and day
# Delete original variable along with redcap_survey_identifier
# Character variable day is made so that alphabetic sorting puts it
# in time order.  This requires putting a leading zero in one-digit days

etrans <- function(event) {
  event     <- sub('_arm_1', '', event)
  event[event == 'screening'] <- 'day_-1'
  sub('day_([1-9])$', 'day_0\\1', event)
}
  
for(n in names(L)) {
  L[[n]]$day <- etrans(L[[n]]$redcap_event_name)
  setDT(L[[n]], key=.q(record_id, day))
  L[[n]][, .q(redcap_event_name, redcap_survey_identifier) := NULL]
}

L[[1]][, table(day)]

# Put baseline and longitudinal data tables in one large R object
R <- c(B, L)

# Remove html tags from variable labels and convert all sequences
# of multiple choices into single multiple choice variables using
# Hmisc::mChoice

R <- lapply(R, cleanupREDCap)

# Save entire database that is in one efficient R object to a
# binary file for fast retrieval
# Add export_time attribute to R to define export date/time
attr(R, 'export_time') <- when
qsave(R, 'R.qs')    # 5.8MB 0.2s
# saveRDS(..., compress='xz') resulted in 3MB R.rds file, 10.7s
# qsave(R, 'R.qs', preset='archive')  resulted in 3.8M, 2.8s

Subsequent programs may retrieve the database using R <- qread('R.qs') and access the export date/time stamp using attr(R, 'export_time').

See the Multiple Files tab for useful R examples for processing multiple files inside objects like the R list just created.

Here are some examples showing how to use mChoice variables.

r <- R$demographics$race_eth
levels(r)   # show all choice descriptions
summary(r)
# Get frequency table of all combinations, numeric form
table(r)
# Same but with labels
table(as.character(r))
# Count number of persons with 'American Indian or Alaska Native'
# as a choice (this was choice # 1)
sum(inmChoice(r, 1))
sum(inmChoice(r, 'American Indian or Alaska Native'))
# Count number of persons in either of two categories
sum(inmChoice(r, c('American Indian or Alaska Native', 'White')))
sum(inmChoice(r, 'White'))
# Count number in both categories
sum(inmChoice(r, c('American Indian or Alaska Native', 'White'),
              condition='all'))
# This is the same as:
sum(inmChoice(r, 'American Indian or Alaska Native') &
    inmChoice(r, 'White'))
# Create a new derived variable for the latter
R$demographics[, aiw := inmChoice(race_eth, c('American Indian or Alaska Native', 'White'),
                                  condition='all')]

SAS, Stata, and SPSS binary files are converted to R data.frames using the R haven package. Here’s an example:

require(haven)   # after you've installed the package
d <- read_sas('mydata.sas7bdat')
d <- read_xpt('mydata.xpt')        # SAS transport files
d <- read_dta('mydata.dta')        # Stata
d <- read_sav('mydata.sav')        # SPSS
d <- read_por('mydata.por')        # Older SPSS files

These import functions carry variable labels into the data frame and convert dates and times appropriately. Character vectors are not converted to factors.

The readxl package that comes with R reads binary Excel spreadsheet .xls and .xlsx files to produce something that is almost a data frame. Here is an example where the imported dataset is converted into a plain data.frame and the variable names are changed to all lower case. read_excel does not handle variable labels. You can specify which sheet or which cell range to import if desired.

d <- readxl::read_excel('my.xlsx')
# Or use require(readxl) then d <- read_excel(...)
d <- as.data.frame(d)
# Or better: make it a data table: setDT(d)
names(d) <- tolower(names(d))

One of the most important principles to following in programming data analyses is to not do the same thing more than once. Repetitive code wastes time and is harder to maintain. One example of avoiding repetition is in reading a large number of files in R. If the files are stored in one directory and have a consistent file naming scheme (or you want to import every .csv file in the directory), one can avoid naming the individual files. The results may be stored in an R list that has named elements, and there are many processing tasks that can be automated by looping over this list.

In the following example assume that all the data files are .csv files in the current working directory, and they all have names of the form xz*.csv. Let’s read all of them and put each file into a data frame named by the characters in front of .csv. These data frames are stuffed into a list named X. The Hmisc csv.get function is used to read the files, automatically translating dates to Date variables, and because lowernames=TRUE is specified, variable names are translated to lower case. There is an option to fetch variable labels from a certain row of each .csv file but we are not using that.

files <- list.files(pattern='xz.*.csv')  # vector of qualifying file names
# Get base part of *.csv
dnames <- sub('.csv', '', files)
X <- list()
i <- 0
for(f in files) {
  cat('Reading file', f, '\n')
  i <- i + 1
  d <- csv.get(f, lowernames=TRUE)
  # To read SAS, Stata, SPSS binary files use a haven function instead
  # To convert to data.table do setDT(d) here
  X[[dnames[i]]] <- d
}
saveRDS(X, 'X.rds', compress='xz')  # Efficiently store all datasets together
# Or qsave(X, 'X.qs')

To process one of the datasets one can do things like summary(X[[3]]) or summary(X$baseline) where the third dataset stored was named baseline because it was imported from baseline.csv.

Now there are many possibilities for processing all the datasets at once such as the following.

names(X)                  # print names of all datasets
k   <- length(X)          # number of datasets
nam <- lapply(X, names)   # list with k elements, each is a vector of names
# Get the union of all variable names used in any dataset
sort(unique(unlist(nam))) # all the variables appearing in any dataset
# Get variable names contained in all datasets
common <- names(X[[1]])
for(i in 2 : k) {
  common <- intersect(common, names(X[[i]]))
  if(! length(common)) break  # intersection already empty
}
sort(common)
# Compute number of variables across datasets
nvar <- sapply(X, length)  # or ncol
# Print number of observations per dataset
sapply(X, nrow)
# For each variable name count the number of datasets containing it
w <- data.table(dsname=rep(names(X), nvar), vname=unlist(nam))
w[, .N, keyby=vname]
# For each variable create a comma-separated list of datasets
# containing it
w[, .(datasets=paste(sort(dsname), collapse=', ')), keyby=vname]
# For datasets having a subject ID variable named id compute
# the number of unique ids
uid <- function(d) if('id' %in% names(d)) uniqueN(d$id) else NA
sapply(X, uid)
# To repeatedly analyze one of the datasets, extract it to a single data frame
d <- X$baseline
describe(d)

Most of these steps are automated in the reptools repository multDataOverview function.

getRs('reptools.r')   # if not done already
m <- multDataOverview(X, ~ id)  # Leave off m <- if not needing to print m
# To print a long table showing all the datasets containing each variable,
# print m
m

Sometimes one keeps multiple versions of the raw data to be imported inside the project folder. If you want to import the file that was last modified you can use the reptools latestFile function. You give it a pattern of file names to match. For example a pattern of .csv$ means “ending with .csv” and a pattern of ^baseline means “starting with baseline”. A pattern of foo means that foo is contained anywhere in the file name. Here is an example.

# Look for file names starting with base, followed by any number of
# characters, and ending with .csv
f <- latestFile('^base.*csv$')
d <- csv.get(f)

The Hmisc package cleanup.import function improves imported data storage in a number of ways including converting double precision variables to integer when originally double but not containing fractional values (this halves the storage requirement). Hmisc::upData is the go-to function for annotating data frames/tables, renaming variables, and dropping variables. Hmisc::dataframeReduced removes problematic variables, e.g., those with a high fraction of missing values or that are binary with very small prevalence.

5.3 Variable Naming

I prefer short but descriptive variable names. As exemplified above, I use variable labels and units to provide more information. For example I wouldn’t name the age variable age.at.enrollment.yrs but would name it age with a label of Age at Enrollment and with units of years. Short, clear names unclutter code, especially formulas in statistical models. One can always fetch a variable label while writing a program (e.g., typing label(d$age) at the console) to check that you have the right variable (or put the data dictionary in a window for easy reference, as shown below). Hmisc package graphics and table making functions such as summaryM and summary.formula specially typeset units in a smaller font.

5.4 Data Dictionary

The Hmisc package contents function will provide a concise data dictionary. Here is an example using the permanent version (which coded binary variables as 0/1 instead of N/Y) of the dataset created above, which can be accessed with the Hmisc getHdata function. The top of the contents output has the number of levels for factor variables hyperlinked. Click on the number to go directly to the list of levels for that variable.

require(Hmisc)
getHdata(stressEcho)
d <- stressEcho
html(contents(d), levelType='table')
d Contents

Data frame:d

558 observations and 31 variables, maximum # NAs:0  
NameLabelsUnitsLevelsStorage
bhrBasal heart ratebpminteger
basebpBasal blood pressuremmHginteger
basedpBasal Double Product bhr*basebpbpm*mmHginteger
pkhrPeak heart ratemmHginteger
sbpSystolic blood pressuremmHginteger
dpDouble product pkhr*sbpbpm*mmHginteger
doseDose of dobutamine givenmginteger
maxhrMaximum heart ratebpminteger
pctMphrPercent maximum predicted heart rate achieved%integer
mbpMaximum blood pressuremmHginteger
dpmaxdoDouble product on max dobutamine dosebpm*mmHginteger
dobdoseDobutamine dose at max double productmginteger
ageAgeyearsinteger
gender2integer
baseEFBaseline cardiac ejection fraction%integer
dobEFEjection fraction on dobutamine%integer
chestpainChest paininteger
restwmaResting wall motion abnormality on echocardiograminteger
posSEPositive stress echocardiograminteger
newMINew myocardial infarctioninteger
newPTCARecent angioplastyinteger
newCABGRecent bypass surgeryinteger
deathinteger
hxofHTHistory of hypertensioninteger
hxofDMHistory of diabetesinteger
hxofCigHistory of smoking3integer
hxofMIHistory of myocardial infarctioninteger
hxofPTCAHistory of angioplastyinteger
hxofCABGHistory of coronary artery bypass surgeryinteger
any.eventDeath, newMI, newPTCA, or newCABGinteger
ecgBaseline electrocardiogram diagnosis3integer

VariableLevels
gendermale
female
hxofCigheavy
moderate
non-smoker
ecgnormal
equivocal
MI

You can write the text output of contents into a text file in your current working directory, and click on that file in the RStudio Files window to create a new tab in the editor panel where you can view the data dictionary at any time. This is especially helpful if you need a reminder of variable definitions that are stored in the variable labels. Here is an example where the formatted data dictionary is saved.

Users having the xless system command installed can pop up a contents window at any time by typing xless(contents(d)) in the console. xless is in Hmisc.
capture.output(contents(d), file='contents.txt')

Or put the html version of the data dictionary into a small browser window to which you can refer at any point in analysis coding.

cat(html(contents(d)), file='contents.html')
browseURL('contents.html', browser='vivaldi -new-window')

RStudio provides a nice way to do this, facilitated by the htmlView helper function in reptools. htmlView takes any number of objects for which an html method exists to render them. They are rendered in the RStudio Viewer pane. If you are running outside RStudio, your default browser will be launched instead.

Occasionally RStudio Viewer will drop its arrow button making it impossible to navigate back and forth to different html outputs.
Code for htmlView and htmlViewx may be viewed in reptools.r.
getRs('reptools.r')
# reptools.r defines htmlView, htmlViewx, kabl, maketabs, dataChk, ...
htmlView(contents(d))

In some cases it is best to have a browser window open to the full descriptive statistics for a data table/frame (see below; the describe function also shows labels, units, and levels).

For either approach it would be easy to have multiple tabs open, one tab for each of a series of data tables, or use htmlView.

To be able to have multiple windows open to see information about datasets it is advantageous to open an external browser window. The htmlViewx function will by default open a Vivaldi browser window with the first output put in a new window and all subsequent objects displayed as tabs within that same window. This behavior can be controlled with the tab argument, and you can use a different browser by issuing for example options(vbrowser='firefox'). As an example suppose that two datasets were read from the hbiostat.org/data data repository, and the data dictionary and descriptive statistics for both datasets were to be converted to html and placed in an external browser window for the duration of the R session.

In Windows you may have to specify a full path and firefox.exe. The tab names will not be correct until Hmisc 4.7-1 appears.
getHdata(support)
getHdata(support2)
htmlViewx(contents(support ), describe(support ),
          contents(support2), describe(support2))

A screenshot of the result is here.

5.5 Efficient Storage and Retrieval With fst

The fst package of Mark Klik provides an incredibly efficient way to write and read binary files holding R data frames or data tables. For very large files, fst adds efficiency by providing fast direct access of selected rows and/or columns. The only downside is that fst drops user-added attributes of columns, e.g. units and label. So we need to use the Hmisc keepHattrib and restoreHattrib functions to save such attributes in a separate object and to apply them back to the data frame/table after running read_fst. Here is an example for a 10,000,000 row data table with three columns.

See Section 10.6 for more examples of the fst package.
require(data.table)
require(fst)
d <- data.table(a=runif(1e7), b=rnorm(1e7),
                f=factor(sample(c('a','b','c'), 1e7, replace=TRUE)))
d <- upData(d, labels=c(a='A', b='B', f='F'), units=c(b='cm'))
Input object size:   200002128 bytes;    3 variables     10000000 observations
New object size:    200003688 bytes;    3 variables 10000000 observations
at <- keepHattrib(d)
# The following took 1.5s (0.16s with default compress=50 but .fst file 1.5x larger)
write_fst(d,  '/tmp/d.fst',  compress=100)     # maximum compression
saveRDS(at, '/tmp/at.rds',   compress='xz')    # not a data frame; can't use fst
dr  <- read_fst('/tmp/d.fst', as.data.table=TRUE)  # 0.06s (0.03s with compress=50)
atr <- readRDS('/tmp/at.rds')
contents(dr)

Data frame:dr   10000000 observations and 3 variables    Maximum # NAs:0


  Levels Storage
a         double
b         double
f      3 integer

+--------+------+
|Variable|Levels|
+--------+------+
|    f   | a,b,c|
+--------+------+
dr <- restoreHattrib(dr, atr)
contents(dr)

Data frame:dr   10000000 observations and 3 variables    Maximum # NAs:0


  Labels Units Levels   Class Storage
a      A              numeric  double
b      B    cm        numeric  double
f      F            3         integer

+--------+------+
|Variable|Levels|
+--------+------+
|    f   | a,b,c|
+--------+------+

A pair of functions to automate this process is found in the saveRfiles script on github. Here is how this streamlines what is above. Let’s also show how to retrieve only a subset of the data table (two columns and only the first 1000 rows).

getRs('saveRfiles.r')
save_fst(d)
Data saved in d.fst and attributes saved in d-attr.rds 
# In the next line the value of d is not used but d is used
# for the base file name
dr <- retrieve_fst(d, as.data.table=TRUE,
                   columns=.q(a, b), from=1, to=1000)
contents(dr)

Data frame:dr   1000 observations and 2 variables    Maximum # NAs:0


  Labels Units   Class Storage
a      A       numeric  double
b      B    cm numeric  double
invisible(file.remove('d.fst', 'd-attr.rds'))

5.6 Efficient Storage and Retrieval With qs

The qs package, like fst, provides fast reading and writing of R objects. Unlike fst, qs can store any type of object, does not lose attributes, and does not allow for direct access of selected rows or columns. Let’s compare qs, fst, and saveRDS for speed and disk storage efficiency. First, make sure that qs keeps all attributes.

require(qs)
getHdata(support)   # loads from hbiostat.org/data; has
                    # lots of variable attributes
qsave(support, '/tmp/support.qs')
w <- qread('/tmp/support.qs')
identical(support, w)
[1] TRUE

Now try some things on the 10,000,000 row data table created above.

saveRDS(d, '/tmp/d.rds')                  # 13.6s 128M
saveRDS(d, '/tmp/dc.rds', compress='xz')  #  124s 116M
write_fst(d, '/tmp/d.fst')                #  0.2s 161M
write_fst(d, '/tmp/dc.fst', compress=100) #  1.4s 122M
qsave(d, '/tmp/d.qs', nthreads=3)         #  0.3s 110M

Read times are as follows.

readRDS('/tmp/d.rds')           # 0.9 s
readRDS('/tmp/dc.rds')          # 7.5 s
read_fst('/tmp/d.fst')          # 0.04s
read_fst('/tmp/dc.fst')         # 0.07s
qread('/tmp/d.qs')              # 0.2 s
qread('/tmp/d.qs', nthreads=6)  # 0.16s

When direct access is not needed, i.e., when whole data frames/tables are to be read, the qs package would be a good default, especially if one takes into account that qs preserves all object attributes. qs dependencies are also more actively supported.