10  Data Manipulation and Aggregation

flowchart LR
dt[data.table]
sub[Subsetting Observations<br>and Variables]
am[Adding/Modifying Variables]
rec[Recode] --> hrec[R Expressions<br>factor<br>fcase<br>score.binary<br>table look-up]
res[Reshape] --> lw[Long to Wide<br>dcast] & wl[Wide to Long<br>melt]
flu[Fast Lookup<br>From Disk] --> dtf[data.table & fst]

The data.table package provides a concise, consistent syntax for managing simple and complex data manipulation tasks, and it is extremely efficient for large datasets. One of the best organized tutorials is this, and a cheatsheet for data transformation is here. Userful data.table vignettes are here. A master cheatsheet for data.table is here from which the general syntax below is taken, where DT represents a data table.

   DT[ i,  j,  by ] # + extra arguments
        |   |   |
        |   |    -------> grouped by what?
        |    -------> what to do?
         ---> on which rows?

The following chart may help one to conceptualize the data.table framework. To enter into the environment of a data table means to be able to address its columns directly without having, for example, the DT prefix.

flowchart TD
DTO["DT ["]
E["Enter Environment<br>of DT<br>DT[rows,columns,by/on]"]
DTO --> E
R[Rows To Fetch<br>Or Change]
C["Columns to Fetch<br><hr>Or Create/Redefine<br>with :="]
OB["by (grouping)<br><hr>on (matching)"]
E --> R & C & OB
R & C & OB --> DTC["]"] --> L[Leave<br>Environment]
L --> N["No :=<br><br>Print/Plot Output<br>or Assign Result To<br>New Data Table"]
L --> IP[":=<br><br>Change DT in Place"]

Data tables are also data frames so work on any method handling data frames. But data tables do not contain rownames.

Several data.table examples follow. I like to hold the current dataset in d to save typing. Some basic operations on data tables are:

d[2]                  # print 2nd row
d[2:4]                # print rows 2,3,4
d[month == 'November']# fetch rows for November
d[month %like%  'mb'] # fetch rows with 'mb' in month
d[month %ilike% 'mb'] # fetch rows with 'mb' in month ignoring case
d[y > 2 & z > 3]      # rows satisfying conditions
d[, age]              # retrieve one variable
d[, .(age, gender)]   # make a new table with two variables
i <- 2; d[, ..i]      # get column 2
v <- 'age'; d[, ..v]  # get variable named by contents of v
d[, (v)]              # another way; these return data tables
d[.N]                 # last row
d[, .N, keyby=age]    # number of rows for each age, sorted
d[1:10, bhr:pkhr]     # first 10 rows, variables bhr - pkhr
d[1:10, !(bhr:pkhr)]  # all but those variables
d[, 2:4]              # get columns 2-4
d[, {...}]            # run any number of lines of code in ...
                      # that reference variables inside d

data.table does many of its operations by reference to avoid the overhead of having multiple copies of data tables. This idea carries over to apparent copies of data tables. Here is an example.

require(Hmisc)
require(data.table)
knit_print <- knitr::normal_print # make knitr standard print output (not kable)
a <- data.table(x1=1:3, x2=letters[1:3])
a
   x1 x2
1:  1  a
2:  2  b
3:  3  c
b <- a                  # no copy; b is just a pointer to a
b[, x2 := toupper(x2)]  # changes a
a
   x1 x2
1:  1  A
2:  2  B
3:  3  C
a <- data.table(x1=1:3, x2=letters[1:3])
a2 <- copy(a)           # fresh copy with its own memory space
a2[, x2 := toupper(x2)] # doesn't change a
a
   x1 x2
1:  1  a
2:  2  b
3:  3  c

10.1 Analyzing Selected Variables and Subsets

getHdata(stressEcho)
d <- stressEcho
setDT(d)
d[, html(describe(age))]
Descriptives age: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580620.99967.3413.4146.8551.0060.0069.0075.0082.0085.00
lowest : 26 28 29 30 33 , highest: 89 90 91 92 93
d[, html(describe(~ age + gender))]
age + gender Descriptives
age + gender

2 Variables   558 Observations

age: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580620.99967.3413.4146.8551.0060.0069.0075.0082.0085.00
lowest : 26 28 29 30 33 , highest: 89 90 91 92 93
gender
nmissingdistinct
55802
 Value        male female
 Frequency     220    338
 Proportion  0.394  0.606
 

d[gender == 'female', html(describe(age))]   # analyze age for females
Descriptives age: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
3380580.99967.0113.7447.0050.7059.2568.0076.0083.0085.00
lowest : 26 28 29 33 34 , highest: 87 88 89 90 91
html(describe(d[, .(age, gender)], 'Age and Gender Stats'))
Age and Gender Stats Descriptives
Age and Gender Stats

2 Variables   558 Observations

age: Age years
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
5580620.99967.3413.4146.8551.0060.0069.0075.0082.0085.00
lowest : 26 28 29 30 33 , highest: 89 90 91 92 93
gender
nmissingdistinct
55802
 Value        male female
 Frequency     220    338
 Proportion  0.394  0.606
 

# Separate analysis by female, male.  Use keyby instead of by to sort the usual way.
d[, print(describe(age, descript=gender)), by=gender]
male : Age [years] 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     220        0       51    0.999    67.86    12.91    45.95    51.00 
     .25      .50      .75      .90      .95 
   62.00    69.00    75.00    81.00    86.00 

lowest : 30 34 38 40 43, highest: 88 89 91 92 93
female : Age [years] 
       n  missing distinct     Info     Mean      Gmd      .05      .10 
     338        0       58    0.999    67.01    13.74    47.00    50.70 
     .25      .50      .75      .90      .95 
   59.25    68.00    76.00    83.00    85.00 

lowest : 26 28 29 33 34, highest: 87 88 89 90 91
Empty data.table (0 rows and 1 cols): gender
# Compute mean and median age by gender
d[, .(Mean=mean(age), Median=median(age)), by=gender]
   gender     Mean Median
1:   male 67.85909     69
2: female 67.00888     68
# To create a new subset
w <- d[gender == 'female' & age < 70, ]

10.2 Adding, Changing, and Removing Variables

With data.table you can create a new data table with added variables, or you can add or redefine variables in a data table in place. The latter has major speed and memory efficiency implications when processing massive data tables. Here d refers to a different data table from the one used above.

# Rename a variable
setnames(d, c('gender', 'height'),
            c(  'sex',      'ht'))
# Easier:
setnames(d, .q(gender, height),
            .q(   sex,     ht))

# Or
ren <- .q(gender=sex, height=ht)
setnames(d, names(ren), ren)

# Or
rename <- function(x, n) setnames(x, names(n), n)
rename(d, .q(gender=sex, height=ht))

# For all variables having baseline_ at the start of their names, remove it
n <- names(d)
setnames(d, n, sub('^baseline_', '', n))   # ^ = at start; use $ for at end

# For all variables having baseline_ at the start of their names, remove it
# and add Baseline to start of variable label
n    <- names(d)
n    <- n[grepl('^baseline_', n)]
ren  <- sub('^baseline_', '', n); names(ren) <- n
# Fetch vector of labels for variables whose names start with baseline_
labs <- sapply(d, label)[n]   # label() result is "" if no label
labs <- paste('Baseline', labs)
d    <- updata(d, rename=ren, labels=labs)

# Change all names to lower case
n <- names(d)
setnames(d, n, tolower(n))

# Abbreviate names of all variables longer than 10 characters
# abbreviate() will ensure that all names are unique
n <- names(d)
setnames(d, n, abbreviate(n, minlength=10))

# For any variables having [...] or (...) in their labels, assume these
# are units of measurement and move them from the label to the units
# attribute
d <- upData(d, moveUnits=TRUE)

# Compute two new variables, storing result in a new data table
z <- d[, .(bmi=wt / ht ^ 2, bsa=0.016667 * sqrt(wt * ht))]

# Add one new variable in place
d[, bmi := wt / ht ^ 2]

# Add two new variables in place
d[, `:=`(bmi = wt / ht ^ 2, bsa=0.016667 * sqrt(wt * ht))]
d[, .q(bmi, bsa) := .(wt / ht ^ 2, 0.016667 * sqrt(wt * ht))]

# Add two new variables whose names are specified in a vector v
v <- .q(var1, var2)   # or c('var1', 'var2')
d[, (v) := .(x1 * x2, sqrt(x2))]

# Add a new variable computed from a function of several
# variables whose names are in character vector v
d[, newvar := do.call('min', .SD), .SDcols=v]

# Compute something requiring a different formula for different types
# of observations
d[, htAdj := ifelse(sex == 'male', ht, ht * 1.07)]  # better" use fifelse in data.table
d[, htAdj := ht * ifelse(sex == 'male', 1, 1.07)]
d[, htAdj := (sex == 'male') * ht + (sex == 'female') * ht * 1.07]
d[sex == 'male',   htAdj := ht]
d[sex == 'female', htAdj := ht * 1.07]
d[, htAdj := fcase(sex == 'male',   ht,          # fcase is in dta.table
                   sex == 'female', ht * 1.07)]
d[, htAdj := fcase(sex = 'female', ht * 1.07, default = ht)]

# Add label & optional units (better to use upData which works on data tables)
adlab <- function(x, lab, un='') {
  label(x) <- lab
  if(un != '') units(x) <- un
  x
}
d[, maxhr := adlab(maxhr, 'Maximum Heart Rate', '/m')]

# Delete a variable (or use upData)
d[, bsa := NULL]

# Delete two variables
d[, `:=`(bsa=NULL, bmi=NULL)]
d[, .q(bsa, bmi) := NULL]

# Delete two variables whose names are in character vector v
d[, (v) := NULL]

# Delete a variable, but not in place (e.g, just print narrower table)
d[, ! 'bsa']

See this for a comprehensive description of ways to remove columns using data.table.

Beginning with version 1.14.7 of data.table there is a new alias for := that is especially useful when assigning or deleting multiple variables:

d[, let(bmi = wt / ht ^ 2, bsa=0.016667 * sqrt(wt * ht))]

10.3 Recoding Variables

# Group levels a1, a2 as a & b1,b2,b3 as b
d[, x := factor(x, .q(a1,a2,b1,b2,b3),
                   .q( a, a, b, b, b))]
# Regroup an existing factor variable
levels(d$x) <- list(a=.q(a1,a2), b=.q(b1,b2,b3))
# or
d <- upData(d, levels=list(x=list(a=.q(a1,a2), b=.q(b1,b2,b3))))
# Or manipulate character strings
d[, x := substring(x, 1, 1)]   # replace x with first character of levels
# or
levels(d$x) <- substring(levels(d$x), 1, 1)

# Recode a numeric variable with values 0, 1, 2, 3, 4 to 0, 1, 1, 1, 2
d[, x := 1 * (x %in% 1:3) + 2 * (x == 4)]
d[, x := fcase(x %in% 1:3, 1,          # fcase is in data.table
               x == 4,     2)]
d[, x := fcase(x %between% c(1,3), 1,
               x    ==       4,    2)]  # %between% is in data.table

# Recode a series of conditions to a factor variable whose value is taken
# from the last condition that is TRUE using Hmisc::score.binary
# Result is a factor variable unless you add retfactor=FALSE
d[, x := score.binary(x1 == 'symptomatic',
                      x2 %in% .q(stroke, MI),
                      death)]
# Same but code with numeric points
d[, x := score.binary(x1 == 'symptomatic',
                      x2 %in% .q(stroke, MI),
                      death,  # TRUE/FALSE or 1/0 variable
                      points=c(1,2,10))]
# Or just reverse the conditions and use fcase which stops at the first
# condition met
d[, x := fcase(death,                  'death',        # takes precedence
               x2 %in% .q(stroke, MI), 'stroke/MI',    # takes next precedence
               x1 == 'symptomatic',    'symptomatic',
               default =               'none')]

# Recode from one set of character strings to another using named vector
# A named vector can be subscripted with character strings as well as integers
states <- c(AL='Alabama', AK='Alaska', AZ='Arizona', ...)
# Could also do:
#  states <- .q(AL=Alabama, AK=Alaska, AZ=Arizona, ..., NM='New Mexico', ...)
# or do a merge for table lookup (see later)
d[, State := states[state]]
# states are unique, state can have duplicates and all are recoded
d[, State := fcase(state == 'AL', 'Alabama',  state='AK', 'Alaska', ...)]

# Recode from integers 1, 2, ..., to character strings
labs <- .q(elephant, giraffe, dog, cat)
d[, x := labs[x]]

# Recode from character strings to integers
d[, x := match(x, labs)]
d[, x := fcase(x == 'elephant', 1,
               x == 'giraffe',  2,
               x == 'dog',      3,
               x == 'cat',      4)]

As an example of more complex hierarchical recoding let’s define codes in a nested list.

a <- list(plant =
            list(vegetable = .q(spinach, lettuce, potato),
                 fruit     = .q(apple, orange, banana, pear)),
          animal =
            list(domestic  = .q(dog, cat, horse),
                 wild      = .q(giraffe, elephant, lion, tiger)) )
a
$plant
$plant$vegetable
[1] "spinach" "lettuce" "potato" 

$plant$fruit
[1] "apple"  "orange" "banana" "pear"  


$animal
$animal$domestic
[1] "dog"   "cat"   "horse"

$animal$wild
[1] "giraffe"  "elephant" "lion"     "tiger"   
a <- unlist(a)
a
plant.vegetable1 plant.vegetable2 plant.vegetable3     plant.fruit1 
       "spinach"        "lettuce"         "potato"          "apple" 
    plant.fruit2     plant.fruit3     plant.fruit4 animal.domestic1 
        "orange"         "banana"           "pear"            "dog" 
animal.domestic2 animal.domestic3     animal.wild1     animal.wild2 
           "cat"          "horse"        "giraffe"       "elephant" 
    animal.wild3     animal.wild4 
          "lion"          "tiger" 
# Pick names of unlist'ed elements apart to define kingdom and type
n <- sub('[0-9]*$', '', names(a))  # remove sequence numbers from ends of names
# Names are of the form kingdom.type; split at .
s       <- strsplit(n, '.', fixed=TRUE)
kingdom <- sapply(s, function(x) x[1])
type    <- sapply(s, function(x) x[2])
# or:   (note \\. is escaped . meaning not to use as wild card)
#        .* = wild card: any number of characters
# kingdom <- sub('\\..*', '', n)  # in xxx.yyy remove .yyy
# type    <- sub('.*\\.', '', n)  # in xxx.yyy remove xxx.
names(kingdom) <- names(type) <- a
w <- data.table(kingdom, type, item=a, key=c('kingdom', 'item'))
w
    kingdom      type     item
 1:  animal  domestic      cat
 2:  animal  domestic      dog
 3:  animal      wild elephant
 4:  animal      wild  giraffe
 5:  animal  domestic    horse
 6:  animal      wild     lion
 7:  animal      wild    tiger
 8:   plant     fruit    apple
 9:   plant     fruit   banana
10:   plant vegetable  lettuce
11:   plant     fruit   orange
12:   plant     fruit     pear
13:   plant vegetable   potato
14:   plant vegetable  spinach
# Example table lookups
cat(kingdom['dog'], ':', type['dog'], '\n')
animal : domestic 
kingdom[.q(dog, cat, spinach)]
     dog      cat  spinach 
"animal" "animal"  "plant" 
type   [.q(dog, cat, giraffe, spinach)]
        dog         cat     giraffe     spinach 
 "domestic"  "domestic"      "wild" "vegetable" 
# But what if there is a plant named the same as an animal?
# Then look up on two keys
w[.('animal', 'lion'), type]
[1] "wild"

10.3.1 Recoding From an Expression File

In a project involving hundreds of variables and a team of analysts and programmers, specification and debugging of code that computes derived variables can be hard to manage, and the programming required may be tedious. It is useful to separate derived variable computations from overall data processing code and to manage long lists of formulas in Github or other repositories. Properly constructed, the code inside such a file can be streamlined, easier to read, and easy to apply to data tables.

Consider the case where each derived variable involves only variables from one dataset. An R list object is a good structure for specification of formulas, new variable names, and labels. Consider this R structure for a single derived variable to be named newname. The Hmisc .q function is used to quote variable names in the drop option. The vector of variable names in drop specifies component variables to be dropped from the data table upon computation of newname. drop= is omitted if no variables are to be dropped. label= and/or units= are omitted if you don’t give newname a label or units of measurement.

list(newname=expression(an R expression of one or more variables),
     label='label for newname', units='units of measurement',
     drop=.q(var1, var2, var3))

The expression to compute must be the first element of the list and it must be named with the new variable. An existing variable name may be used instead, resulting in transformation and replacement of that variable in the data table.

Here is an example in which body mass index is computed.

list(bmi = expression(703 * weight / height ^ 2),
     label='Body Mass Index',
     drop=.q(weight, height), units='Kg/m^2')

The flexibility of R comes into play once you consider that complex derived variable expressions can use external functions that you store elsewhere.

For multiple derived variables these lists are stacked, and surrounded by an overall list that contains all these variable-specific lists. When dealing with multiple datasets, these overall lists can be combined into a mega list. An example of a Github file containing such specification is here, printed below. Ignore the updata object for now.

# Define any complex derivations or repetitively used code as functions here
# These functions may be called as needed from code inside lists inside
# the derv object below
# Example: (not very complex but you'll get the idea)
# myfun <- function(x1, x2) x1 > 3 | (x1 < 1 & x2 < 0)
# ...
# list(x = expression(myfun(x1, x2)))
# myfun is defined first, when this entire file is source()'d


derv <- list(  # mega list for all datasets
  # Derived variables for dataset A
  A = list(
    list(bmi = expression(703 * weight / height ^ 2),
     label='Body Mass Index',
     units='Kg/m^2')
  ),
  
  # Derived variables for dataset B, and transformation of existing var z
  B = list(
    list(x = expression(x1 * x2 / x3),
         label = 'X', drop=.q(x1, x2, x3)),
    list(y = expression(y1 / y2),
         label = 'Y', drop=.q(y1, y2)),
    list(z = expression(z * 1000))
  )
)

# Another approach using upData() instead of runDeriveExpr()

updata <- list(
  # upData input for dataset A
  A = list(
    bmi = expression(703 * weight / height ^ 2),
    labels = c(bmi='Body Mass Index'),
    units  = c('Kg/m^2')
  ),
  B = list(
    # upData input for dataset B
    x = expression(x1 * x2 / x3),
    y = expression(y1 / y2),
    z = expression(z * 1000),
    labels = .q(x=X, y=Y),
    drop   = .q(x1, x2, x3, y1, y2)
  )
)

# To syntax-check code in this file, just source() it into any R session
# after running require(Hmisc) (to define .q if you use it)

The following code reads the mega list, executing the code to create the needed definitions and storing the overall derivation information in object derv.

derv was the object name to which the mega list was assigned inside the external derivation file.
source(file)
derv
$A
$A[[1]]
$A[[1]]$bmi
expression(703 * weight/height^2)

$A[[1]]$label
[1] "Body Mass Index"

$A[[1]]$units
[1] "Kg/m^2"



$B
$B[[1]]
$B[[1]]$x
expression(x1 * x2/x3)

$B[[1]]$label
[1] "X"

$B[[1]]$drop
[1] "x1" "x2" "x3"


$B[[2]]
$B[[2]]$y
expression(y1/y2)

$B[[2]]$label
[1] "Y"

$B[[2]]$drop
[1] "y1" "y2"


$B[[3]]
$B[[3]]$z
expression(z * 1000)

The reptools repository has a function runDeriveExpr that runs the items in derv from a single dataset against that dataset using data.table to effect the specified changes. The input datasets must be data.tables. By default, runDeriveExpr prints to the console information about the changes. Let’s create data tables A and B to show how this works.

getRs('reptools.r')   # defines runDeriveExpr
A <- data.table(id=1:2, height=c(70, 50), weight=c(200, 170))
A
   id height weight
1:  1     70    200
2:  2     50    170
B <- data.table(id=1:2, x1=11:12, x2=21:22, x3=31:32,
                y1=41:42, y2=51:52, z=1:2)
runDeriveExpr(A, derv$A)
Derived variable bmi added with label units
contents(A)

Data frame:A    2 observations and 4 variables    Maximum # NAs:0

                Labels  Units   Class Storage
id                                    integer
height                                 double
weight                                 double
bmi    Body Mass Index Kg/m^2 numeric  double
print(A)
   id height weight      bmi
1:  1     70    200 28.69388
2:  2     50    170 47.80400
runDeriveExpr(B, derv$B)
Derived variable x added with label; dropped variables x1, x2, x3
Derived variable y added with label; dropped variables y1, y2
Existing variable z changed
contents(B)

Data frame:B    2 observations and 4 variables    Maximum # NAs:0


   Labels   Class Storage
id                integer
z                  double
x       X numeric  double
y       Y numeric  double
print(B)
   id    z        x         y
1:  1 1000 7.451613 0.8039216
2:  2 2000 8.250000 0.8076923

Note that runDeriveExpr does the data table modifications in-place on its first argument, so one does not assign the result of the function to an object. If you want to keep the original data table intact, do something like Aorig <- copy(A) before running runDeriveExpr.

If one only needed to assign labels and units, this would best be done with the Hmisc upData function. upData can also be used to transform existing variables and to add new variables. An upData approach to the whole process involves defining the upData information as a list in the external file as was also done in testderived.r.

There is one small change to the usual upData input: formulas must be enclosed in expression().
# updata object was defined by source(file) above
A <- data.table(id=1:2, height=c(70, 50), weight=c(200, 170))
B <- data.table(id=1:2, x1=11:12, x2=21:22, x3=31:32,
                y1=41:42, y2=51:52, z=1:2)
A <- do.call('upData', c(list(object=A), updata$A))
Input object size:   1584 bytes;     3 variables     2 observations
Added variable      bmi
New object size:    2216 bytes; 4 variables 2 observations
B <- do.call('upData', c(list(object=B), updata$B))
Input object size:   2336 bytes;     7 variables     2 observations
Added variable      x
Added variable      y
Modified variable   z
Dropped variables   x1,x2,x3,y1,y2
New object size:    2792 bytes; 4 variables 2 observations
A
   id height weight      bmi
1:  1     70    200 28.69388
2:  2     50    170 47.80400
B
   id    z        x         y
1:  1 1000 7.451613 0.8039216
2:  2 2000 8.250000 0.8076923

Note that any arguments to upData may appear in the lists that defined the updata object.

10.4 Operations on Multiple Data Tables

Consider an example where two data tables have almost the same columns, and rows are identified uniquely using the id variable. We wish to use non-NA values in the second data table to fill in NA values in the corresponding positions in the first data table. The solution is simple, using the data.table rbind function to combine the rows, then for every variable in the combined data table taking the first non-NA value for a given id.

a <- data.table(id = c(1, 2, 3), v1 = c("w", "x", NA), v2 = c("a", NA, "c"),
                  v3 = 7:9)
b <- data.table(id = c(2, 3, 4), v1 = c(NA, "y", "z"), v2 = c("b", "c", NA))
w <- rbind(a, b, fill=TRUE)  # fill creates NAs for b's v3
a
   id   v1   v2 v3
1:  1    w    a  7
2:  2    x <NA>  8
3:  3 <NA>    c  9
b
   id   v1   v2
1:  2 <NA>    b
2:  3    y    c
3:  4    z <NA>
w
   id   v1   v2 v3
1:  1    w    a  7
2:  2    x <NA>  8
3:  3 <NA>    c  9
4:  2 <NA>    b NA
5:  3    y    c NA
6:  4    z <NA> NA
g <- function(x) x[! is.na(x)][1]  # first non-NA value in vector x
u <- w[,             lapply(.SD, g), by = id]
#                    |      |    |        |
#   for every variable      |    |        |
#  in current id's data table    |        |
#               run the g function        |
#               id variable defining groups
u
   id v1   v2 v3
1:  1  w    a  7
2:  2  x    b  8
3:  3  y    c  9
4:  4  z <NA> NA

10.5 Reshaping Data

To reshape data from long to wide format, take the L data table above as an example.

L <- data.table(id  = c(2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5),
                day = c(1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3),
                y    =  1 : 12, key='id')
L
    id day  y
 1:  2   1  1
 2:  2   2  2
 3:  2   3  3
 4:  3   1  4
 5:  3   2  5
 6:  3   3  6
 7:  3   4  7
 8:  4   1  8
 9:  4   2  9
10:  5   1 10
11:  5   2 11
12:  5   3 12
w <- dcast(L, id ~ day, value.var='y')
w
   id  1  2  3  4
1:  2  1  2  3 NA
2:  3  4  5  6  7
3:  4  8  9 NA NA
4:  5 10 11 12 NA
# Now reverse the procedure  (see also meltData in reptools)
m <- melt(w, id.vars='id', variable.name='day', value.name='y')
m
    id day  y
 1:  2   1  1
 2:  3   1  4
 3:  4   1  8
 4:  5   1 10
 5:  2   2  2
 6:  3   2  5
 7:  4   2  9
 8:  5   2 11
 9:  2   3  3
10:  3   3  6
11:  4   3 NA
12:  5   3 12
13:  2   4 NA
14:  3   4  7
15:  4   4 NA
16:  5   4 NA
setkey(m, id, day)   # sorts
m[! is.na(y)]
    id day  y
 1:  2   1  1
 2:  2   2  2
 3:  2   3  3
 4:  3   1  4
 5:  3   2  5
 6:  3   3  6
 7:  3   4  7
 8:  4   1  8
 9:  4   2  9
10:  5   1 10
11:  5   2 11
12:  5   3 12

10.6 Text Analysis

Here is a type of “short and narrow to long and more narrow” data table conversion to split text into individual words. Consider a data table with a variable x that is a character string containing any number of words separated by spaces or commas (commas are not actually used in this example). Use the data.table tstrsplit function to split the string by these delimeters to form multiple rows. The original row numbers (line numbers) are preserved as a column named line. This is the same operation as done by the tidytext unnest_tokens function and the corpus text_split function. We will use the corpus approach later.

text <- c("Because I could not stop for Death -",
          "He kindly stopped for me -",
          "The Carriage held but just Ourselves -",
          "and Immortality")     # from a poem by Emily Dickinson
d <- data.table(x=text)
w <- d[, tstrsplit(x, split=' |,')]   # | is "or" in regular expressions
w[, line := 1 : .N]              # add original line numbers
w                                # tstrsplit is in data.table
        V1          V2      V3   V4   V5        V6    V7   V8 line
1: Because           I   could  not stop       for Death    -    1
2:      He      kindly stopped  for   me         -  <NA> <NA>    2
3:     The    Carriage    held  but just Ourselves     - <NA>    3
4:     and Immortality    <NA> <NA> <NA>      <NA>  <NA> <NA>    4
u <- melt(w, id.vars='line', value.name='word')
u <- u[! is.na(word),]           # remove padding for short sentences
u
    line variable        word
 1:    1       V1     Because
 2:    2       V1          He
 3:    3       V1         The
 4:    4       V1         and
 5:    1       V2           I
 6:    2       V2      kindly
 7:    3       V2    Carriage
 8:    4       V2 Immortality
 9:    1       V3       could
10:    2       V3     stopped
11:    3       V3        held
12:    1       V4         not
13:    2       V4         for
14:    3       V4         but
15:    1       V5        stop
16:    2       V5          me
17:    3       V5        just
18:    1       V6         for
19:    2       V6           -
20:    3       V6   Ourselves
21:    1       V7       Death
22:    3       V7           -
23:    1       V8           -
    line variable        word
u <- u[, .q(line, wordno, variable, word) :=
          .(line, as.integer(substring(variable, 2)), NULL, word)]
setkey(u, line, wordno)
print(u)                         # don't know why u without print() doesn't print
    line        word wordno
 1:    1     Because      1
 2:    1           I      2
 3:    1       could      3
 4:    1         not      4
 5:    1        stop      5
 6:    1         for      6
 7:    1       Death      7
 8:    1           -      8
 9:    2          He      1
10:    2      kindly      2
11:    2     stopped      3
12:    2         for      4
13:    2          me      5
14:    2           -      6
15:    3         The      1
16:    3    Carriage      2
17:    3        held      3
18:    3         but      4
19:    3        just      5
20:    3   Ourselves      6
21:    3           -      7
22:    4         and      1
23:    4 Immortality      2
    line        word wordno

The work can be done with a single line of code if you use the splitstackshape package’s cSplit function. Or use the following intuitive function, which requires only simple functions.

unpackstr <- function(x, split=' |,|;|:') {
  s <- strsplit(x, split=split)
  s <- lapply(s, function(u) list(wordno=1 : length(u), word=u))
  r <- rbindlist(s, idcol='line')   # rbindlist is in data.table; stack lists into single table
  setkey(r, line, wordno)
  r
}
unpackstr(text)
    line wordno        word
 1:    1      1     Because
 2:    1      2           I
 3:    1      3       could
 4:    1      4         not
 5:    1      5        stop
 6:    1      6         for
 7:    1      7       Death
 8:    1      8           -
 9:    2      1          He
10:    2      2      kindly
11:    2      3     stopped
12:    2      4         for
13:    2      5          me
14:    2      6           -
15:    3      1         The
16:    3      2    Carriage
17:    3      3        held
18:    3      4         but
19:    3      5        just
20:    3      6   Ourselves
21:    3      7           -
22:    4      1         and
23:    4      2 Immortality
    line wordno        word

As seen above, tokenizing words in text strings is not that difficult. Tokenizing sentences is harder due to the presence of abbreviations that contain periods. The corpus package handles this nicely, including support for multiple languages. Let’s do some text processing on Barack Obama’s 2016 State of the Union Speech provided by Taylor Arnold and Lauren Tilton in their text analysis tutorial. Paragraphs are delimited by blank lines, and there can be multiple sentences within a paragraph. Use corpus’s text_split function to tokenize sentences, putting each sentence into a separate line. Then the same function is used on the tokenized sentences to tokenize words. This allows us to link a word back to a particular sentence using the parent variable created by text_split.

text <- readLines('https://raw.githubusercontent.com/programminghistorian/jekyll/gh-pages/assets/basic-text-processing-in-r/sotu_text/236.txt')
length(text)
[1] 140
# Change all HTML &mdash; to actual dash
s <- gsub('&mdash;', ' - ', text)
# Remove laughter which is marked by [laughter] and [Laughter]
# Must escape [] in regular expressions using \\
# [Ll] means any element within [], i.e., either upper or lower case L
s <- gsub('\\[[Ll]aughter\\]', ' ', s)

# Tokenize
require(corpus)
# Get overall statistics on combined text lines, ignoring empty lines
sc <- paste(s, collapse=' ')
text_stats(sc)
  tokens types sentences
1   6802  1603       359
# Summarize paragraphs, making sure that blank lines separating them
# are ignoreod
text_stats(setdiff(s, ''))
   tokens types sentences
1      83    60         6
2     100    73         4
3     114    81         6
4      45    28         3
5      96    60         5
6     163   101         7
7      54    41         2
8      76    48         5
9      79    52         5
10    117    82         5
11    112    78         6
12     57    41         2
13     70    52         5
14     83    59         3
15     55    45         3
16    108    81         4
17     96    65         7
18    143    89         6
19    129    91         8
20    147    87         8
⋮  (70 rows total)
# Summarize 4-word combinations ignoring punctuation
term_stats(s, ngrams=4, filter=text_filter(drop_punct=TRUE))
   term                     count support
1  i see it in                  6       4
2  see it in the                6       4
3  right thing to do            4       4
4  the right thing to           4       4
5  the united states of         4       4
6  united states of america     4       4
7  in this new economy          4       3
8  it's the right thing         3       3
9  the american people and      3       3
10 when it comes to             3       3
11 a country have to            2       2
12 and a lot of                 2       2
13 as a country have            2       2
14 brings me to the             2       2
15 country have to answer       2       2
16 have the final word          2       2
17 in favor of the              2       2
18 it's a problem and           2       2
19 keep america safe and        2       2
20 on the assembly line         2       2
⋮  (5761 rows total)
# Find all instances = 'military'
text_locate(s, 'military')
  text             before              instance              after              
1 59   …ecause you'll be debating our  military , most of America's business le…
2 67   …n close. We spend more on our  military  than the next eight nations co…
3 77   …e world, authorize the use of  military  force against ISIL. Take a vot…
4 85   … of Ebola in West Africa. Our  military , our doctors, our development …
5 91   …p means a wise application of  military  power and rallying the world b…
# Count occurrences of 'the best'
text_count(sc, 'the best')
[1] 3
# Break into sentences, recognizing abbreviations' periods
sen <- text_split(s, units='sentences')
sen
   parent index text                                                            
1  1          1 Thank you.                                                      
2  1          2 Mr. Speaker, Mr. Vice President, Members of Congress, my fellow…
3  1          3 And for this final one, I'm going to try to make it a little sh…
4  1          4 I know some of you are antsy to get back to Iowa.               
5  1          5 I've been there.                                                
6  1          6 I'll be shaking hands afterwards if you want some tips.         
7  2          1                                                                 
8  3          1 Now, I understand that because it's an election season, expecta…
9  3          2 But, Mr. Speaker, I appreciate the constructive approach that y…
10 3          3 So I hope we can work together this year on some bipartisan pri…
11 3          4 So, who knows, we might surprise the cynics again.              
12 4          1                                                                 
13 5          1 But tonight I want to go easy on the traditional list of propos…
14 5          2 Don't worry, I've got plenty -   - from helping students learn …
15 5          3 And I will keep pushing for progress on the work that I believe…
16 5          4 All these things still matter to hard-working families.         
17 5          5 They're still the right thing to do.                            
18 5          6 And I won't let up until they get done.                         
19 6          1                                                                 
20 7          1 But for my final address to this Chamber, I don't want to just …
⋮  (429 rows total)
dim(sen)
[1] 429   3
# Compute number of tokens per sentence
tps <- text_stats(sen)
# Compute frequency distribution of sentence lengths
table(tps$tokens)

 0  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
70  4  6 15 11  5 15 15 15 13 11 18 13 14 13 13  9 13 13  6 13 13  5  5  4  9 
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 45 46 47 48 49 50 52 57 
11  6  5  6  6  6  3  2  4  5  2  4  2  1  2  1  3  6  4  2  2  1  2  1  1 
# Print longest sentence (but this does count punctuation as words)
longest_sentence <- which(tps$tokens == max(tps$tokens))
sen[longest_sentence, 'text']
[1] "But I can promise that a little over a year from now, when I no longer ho…"
s[grep('But I can promise that', s)]
[1] "It is not easy. Our brand of democracy is hard. But I can promise that a little over a year from now, when I no longer hold this office, I will be right there with you as a citizen, inspired by those voices of fairness and vision, of grit and good humor and kindness, that have helped America travel so far. Voices that help us see ourselves not, first and foremost, as Black or White or Asian or Latino, not as gay or straight, immigrant or native born, not Democrat or Republican, but as Americans first, bound by a common creed. Voices Dr. King believed would have the final word: voices of \"unarmed truth and unconditional love.\" "
words <- text_split(sen, units='tokens')
# Consider also:
# words <- text_split(sen, units='tokens', filter=text_filter(drop_punct=TRUE))
words
   parent index text     
1  1          1 Thank    
2  1          2 you      
3  1          3 .        
4  2          1 Mr       
5  2          2 .        
6  2          3 Speaker  
7  2          4 ,        
8  2          5 Mr       
9  2          6 .        
10 2          7 Vice     
11 2          8 President
12 2          9 ,        
13 2         10 Members  
14 2         11 of       
15 2         12 Congress 
16 2         13 ,        
17 2         14 my       
18 2         15 fellow   
19 2         16 Americans
20 2         17 :        
⋮  (6872 rows total)
length(unique(words$parent))   # same as number of sentences
[1] 429
# Convert to data.table.  as.data.frame and as.data.table don't work
# because the text column has a special corpus class
# Trim white space from words and convert to lower case
w <- with(words,
          data.table(parent=as.integer(as.character(parent)),
                     index,
                     word=tolower(trimws(as.character(text))),
                     key=.q(parent, index)))

# Remove 'words' that contain only punctuation, and zero length words
# [[:punct:]] is the regular expression for any accepted punctuation characters
# ^ means start of line of text, $ is the end of a line
# So the overall regular expression means punctuation and nothing but
w <- w[! grepl('^[[:punct:]]+$', word) & nchar(word) > 0,]

# Show distribution of word lengths
w[, table(nchar(word))]

   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   17 
 182 1086 1220 1145  614  572  489  293  200  138   59   24   16    5    1    1 
# Compute word frequencies
f <- w[, as.data.table(table(word))]
# Sort in descending order of frequency
f <- f[order(-N),]
# Find 10 most commonly used words, sorted in descending order of frequency
f[1:10, ]
    word   N
 1:  the 280
 2:   to 209
 3:  and 189
 4:   of 148
 5: that 125
 6:   we 124
 7:    a 120
 8:   in 105
 9:  our  96
10:   is  72
# Get frequencies of most common word pairs
# shift(x) = previous value, within sentence (parent)
# Service function g to paste two words together to but return NA if
# either is missing.  This applies to the first word of a sentence,
# which has no predecessor.
g <- function(a, b) ifelse(is.na(a) | is.na(b), NA, paste(a, b))
u <- w[, pair := g(shift(word), word), by=parent]
u
f <- u[, as.data.table(table(pair))]
f <- f[order(-N),]
f[1:10, ]
         pair  N
 1:    in the 27
 2:    of the 20
 3: the world 19
 4:   and the 15
 5:      as a 13
 6:   have to 12
 7:   want to 12
 8:    on the 11
 9:     to be 11
10:     i see 10

10.7 Fast Lookup from Disk

The fst package introduced in Section 5.5 provides fast, efficient storage of data frames and tables and instant retrieval of selected rows and columns from the binary disk file, as long as the rows are consecutive. In the following example we create a 10,000,000 row data table grouped by region and state, with variable number of observations per state. The R datasets package provides built-in variables state.name and state.division to facilitate this. Start by creating a lookup table that defines regions (division) given state names. Then create the large data table with a variable x that is repeatedly measured within each state.

division <- as.character(state.division)
names(division) <- state.name
n <- 1e7
set.seed(1)
d <- data.table(state = sample(state.name, n, replace=TRUE),
                x     = runif(n))
# Add division variable.names to data table
d[, division := division[state]]
# Sort data table by division and state within division
setkey(d, division, state)   # 0.28s

Make two index files, one for retrieval of whole divisions, and one for retrieval of one state. For each group compute the first and last row for it. Start by creating a variable rowno that has values \(1, 2, 3, ..., 10^{7}\).

d[, rowno := 1 : .N]  # 0.03s

Now we can compute starting and ending rows for each group.

idx_d <- d[, .(start=min(rowno), end=max(rowno)),
           by=.(division)]
idx_d
             division   start      end
1: East North Central       1   999941
2: East South Central  999942  1800465
3:    Middle Atlantic 1800466  2400371
4:           Mountain 2400372  3999911
5:        New England 3999912  5199749
6:            Pacific 5199750  6200453
7:     South Atlantic 6200454  7799708
8: West North Central 7799709  9199312
9: West South Central 9199313 10000000
idx_s <- d[, .(start=min(rowno), end=max(rowno)),
           by=.(division, state)]                 # 0.5s
head(idx_s)
             division     state  start     end
1: East North Central  Illinois      1  200066
2: East North Central   Indiana 200067  399511
3: East North Central  Michigan 399512  599508
4: East North Central      Ohio 599509  799516
5: East North Central Wisconsin 799517  999941
6: East South Central   Alabama 999942 1199664

Now store the data table and the two row number lookup tables for future use, using fst. Since we didn’t add labels or units to variables we don’t need to use the method in Section 5.5.

When creating a data table with 100,000,000 rows, write_fst requires 5s to write the file using standard compression, resulting in a file size of 1.3G and a read_fst run time of 6s. When using compress=100, write_fst takes 42s and creates a 564MB file, requiring 7s for read_fst. fst is multi-threaded, and timings given here are for a 12-core machine. Total run time will be longer if you have fewer cores.
require(fst)
write_fst(d,     'states.fst',    compress=100)   # 4.3s, 57MB
write_fst(idx_d, 'divlocs.fst',   compress=100)
write_fst(idx_s, 'statelocs.fst', compress=100)   # 0s, 1441 bytes

Directly access data from disk for Kentucky in the East South Central division.

# Lookup by two data.table keys
i <- idx_s[.("East South Central", "Kentucky")]
i
             division    state   start     end
1: East South Central Kentucky 1199665 1400207
# First fetch Kentucky data from the in-memory data table making use of keys
d[.("East South Central", "Kentucky")]   # 0.007s
           state          x           division   rowno
     1: Kentucky 0.57014110 East South Central 1199665
     2: Kentucky 0.77432160 East South Central 1199666
     3: Kentucky 0.35787757 East South Central 1199667
     4: Kentucky 0.54464932 East South Central 1199668
     5: Kentucky 0.36688879 East South Central 1199669
    ---                                               
200539: Kentucky 0.08608586 East South Central 1400203
200540: Kentucky 0.91762672 East South Central 1400204
200541: Kentucky 0.51089009 East South Central 1400205
200542: Kentucky 0.86051231 East South Central 1400206
200543: Kentucky 0.89353138 East South Central 1400207
# Now fetch from disk, avoiding having to have whole dataset in memory
# Lookup start and end rows by running  read_fst within the i environment
dr <- i[, read_fst('states.fst', as.data.table=TRUE,
        from=start, to=end)]  # 0.025s
dr
           state          x           division   rowno
     1: Kentucky 0.57014110 East South Central 1199665
     2: Kentucky 0.77432160 East South Central 1199666
     3: Kentucky 0.35787757 East South Central 1199667
     4: Kentucky 0.54464932 East South Central 1199668
     5: Kentucky 0.36688879 East South Central 1199669
    ---                                               
200539: Kentucky 0.08608586 East South Central 1400203
200540: Kentucky 0.91762672 East South Central 1400204
200541: Kentucky 0.51089009 East South Central 1400205
200542: Kentucky 0.86051231 East South Central 1400206
200543: Kentucky 0.89353138 East South Central 1400207

Because of efficiencies built into fst, there is very little overhead from extracting > 200,000 observations for Kentucky from disk, and the 343MB of memory for storing data table d would not be needed if processing only a subset of d.

Extract data for all east south central states.

i <- idx_d['East South Central']
i
             division  start     end
1: East South Central 999942 1800465
dr <- i[, read_fst('states.fst', as.data.table=TRUE,
        from=start, to=end)]
dr
            state          x           division   rowno
     1:   Alabama 0.19261725 East South Central  999942
     2:   Alabama 0.42661174 East South Central  999943
     3:   Alabama 0.63643172 East South Central  999944
     4:   Alabama 0.98494926 East South Central  999945
     5:   Alabama 0.18569219 East South Central  999946
    ---                                                
800520: Tennessee 0.79819564 East South Central 1800461
800521: Tennessee 0.18457323 East South Central 1800462
800522: Tennessee 0.36241596 East South Central 1800463
800523: Tennessee 0.67113880 East South Central 1800464
800524: Tennessee 0.09263445 East South Central 1800465

Instead of dealing with predefined groups, what if we wanted to retrieve from the external file all the observations satisfying conditions that are defined by values of variables in the file? read_fst is so fast that one can randomly access thousands of rows, one at a time. In the code below, the 1006 observations with values of x above 0.9999 are fetched from the 10,000,000 row data table in a total of 0.13s. These single observations are then stacked into a data table.

# Read subsetting variable(s) from whole dataset
x <- read_fst('states.fst', columns='x')$x   # 0.05s
i <- which(x > 0.9999)   # row numbers
n <- length(i)           # number of rows to fetch
n
[1] 1006
# Set aside a list with n elements, each one a data.table
w <- vector('list', n)
k <- 0
for(j in i) {    # 0.13s total
  k <- k + 1
  w[[k]] <- read_fst('states.fst', from=j, to=j)
}
# Run the data.table rbindlist function to efficiently stack
# all the data tables
u <- rbindlist(w)   # 0.002s
u
         state         x           division   rowno
   1: Illinois 0.9999242 East North Central    2949
   2: Illinois 0.9999345 East North Central   15178
   3: Illinois 0.9999093 East North Central   22448
   4: Illinois 0.9999033 East North Central   23355
   5: Illinois 0.9999132 East North Central   38324
  ---                                              
1002:    Texas 0.9999857 West South Central 9942802
1003:    Texas 0.9999624 West South Central 9951197
1004:    Texas 0.9999604 West South Central 9961419
1005:    Texas 0.9999215 West South Central 9969919
1006:    Texas 0.9999605 West South Central 9984511

If you want to subset a fst file on multiple variables within the file, here is a prototypical example. Assume that w.fst contains a data table.

u <- read_fst('w.fst', columns=.q(age, sex), as.data.table=TRUE)
i <- u[, which(sex == 'male' & age > 70)]
# Or use a more complex single expression:
# i <- read_fst('w.fst', columns=.q(age, sex), as.data.table=TRUE)[
#      , which(sex == 'male' & age > 70)]
n <- length(i)
w <- vector('list', n)
k <- 0
for(j in i) {
  k <- k + 1
  w[[k]] <- read_fst('w.fst', from=j, to=j)
  # Speed up by adding columns= if want to retrieve a minority
  # of variables
}
u <- rbindlist(w)
# Remove external files if no longer needed
invisible(file.remove('states.fst', 'divlocs.fst', 'statelocs.fst'))