12  Merging Data

flowchart LR
SM[Merge One Record/Subject Data Table With Longitudinal Data]
HM[Hmisc::Merge<br>Monitors Merging Process]
tlu[Merging for Table Look-up]
close[Merging on Closest Matches]

Consider a baseline dataset b and a longitudinal dataset L, with subject ID of id.

For more information see this, this, this, this and this. To merge any number of datasets at once and obtain a printed report of how the merge went, use the Hmisc Merge function.
require(Hmisc)
require(data.table)
b <- data.table(id=1:4, age=c(21, 28, 32, 23), key='id')
L <- data.table(id  = c(2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5),
                day = c(1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3),
                y    =  1 : 12, key='id')
b
   id age
1:  1  21
2:  2  28
3:  3  32
4:  4  23
L
    id day  y
 1:  2   1  1
 2:  2   2  2
 3:  2   3  3
 4:  3   1  4
 5:  3   2  5
 6:  3   3  6
 7:  3   4  7
 8:  4   1  8
 9:  4   2  9
10:  5   1 10
11:  5   2 11
12:  5   3 12
# Merge b and L to look up baseline age and associate it with all follow-ups
b[L, on=.(id)]   # Keeps all ids in L (left inner join)
    id age day  y
 1:  2  28   1  1
 2:  2  28   2  2
 3:  2  28   3  3
 4:  3  32   1  4
 5:  3  32   2  5
 6:  3  32   3  6
 7:  3  32   4  7
 8:  4  23   1  8
 9:  4  23   2  9
10:  5  NA   1 10
11:  5  NA   2 11
12:  5  NA   3 12
L[b, on=.(id)]   # Keeps all ids in b (right inner join)
    id day  y age
 1:  1  NA NA  21
 2:  2   1  1  28
 3:  2   2  2  28
 4:  2   3  3  28
 5:  3   1  4  32
 6:  3   2  5  32
 7:  3   3  6  32
 8:  3   4  7  32
 9:  4   1  8  23
10:  4   2  9  23
L[b, on=.(id), nomatch=NULL]  # Keeps only ids in both b and L (right outer join)
   id day y age
1:  2   1 1  28
2:  2   2 2  28
3:  2   3 3  28
4:  3   1 4  32
5:  3   2 5  32
6:  3   3 6  32
7:  3   4 7  32
8:  4   1 8  23
9:  4   2 9  23
uid <- unique(c(b[, id], L[, id]))
L[b[.(uid), on=.(id)]]         # Keeps ids in either b or c (full outer join)
    id day  y age
 1:  1  NA NA  21
 2:  2   1  1  28
 3:  2   2  2  28
 4:  2   3  3  28
 5:  3   1  4  32
 6:  3   2  5  32
 7:  3   3  6  32
 8:  3   4  7  32
 9:  4   1  8  23
10:  4   2  9  23
11:  5   1 10  NA
12:  5   2 11  NA
13:  5   3 12  NA
merge(b, L, by='id', all=TRUE) # also full outer join; calls merge.data.table
    id age day  y
 1:  1  21  NA NA
 2:  2  28   1  1
 3:  2  28   2  2
 4:  2  28   3  3
 5:  3  32   1  4
 6:  3  32   2  5
 7:  3  32   3  6
 8:  3  32   4  7
 9:  4  23   1  8
10:  4  23   2  9
11:  5  NA   1 10
12:  5  NA   2 11
13:  5  NA   3 12
# Use the Hmisc Merge function which reports on how the merge went
# Merge operates on both data.tables and data.frames and can handle
# any number of them.  In details printed by Merge below, #1
# refers to the first data table (b).
Merge(b, L, id= ~ id)
       Vars Obs Unique IDs IDs in #1 IDs not in #1
b         2   4          4        NA            NA
L         3  12          4         3             1
Merged    4  13          5         4             1

Number of unique IDs in any data frame : 5 
Number of unique IDs in all data frames: 3 
    id age day  y
 1:  1  21  NA NA
 2:  2  28   1  1
 3:  2  28   2  2
 4:  2  28   3  3
 5:  3  32   1  4
 6:  3  32   2  5
 7:  3  32   3  6
 8:  3  32   4  7
 9:  4  23   1  8
10:  4  23   2  9
11:  5  NA   1 10
12:  5  NA   2 11
13:  5  NA   3 12

To help in remembering the order of data tables when joining them, think about how R does subscripting. a[b] for vectors a and b results in length(b) observations when b is a vector of positive integers or is a character vector corresponding to names(a). Likewise, for two data tables a, b, a[b] results in a data table with the number of rows equal to the number of rows in b.

For very large data tables, giving the data tables keys will speed execution, e.g.:

setkey(d, id)
setkey(d, state, city)

Join/merge can be used for data lookups:

s <- data.table(st=.q(AL, AK, AZ, CA, OK), y=5:1)
stateAbbrevs <- data.table(state=state.abb, State=state.name)
s[stateAbbrevs, , on=.(st=state), nomatch=NULL]
   st y      State
1: AL 5    Alabama
2: AK 4     Alaska
3: AZ 3    Arizona
4: CA 2 California
5: OK 1   Oklahoma

12.1 Lookup Participant Disposition

In most situations it is useful to permanently hold certain subject-level variables inside the primary analysis file. In others it is more useful to have a separate dataset that defines categories to which a subject belongs, and to look up those categories by matching on ID. This is particularly useful when there are many analysis tables and we wish to defer calculation of some of the derived variables. The following example shows how to do this.

Suppose that multiple clinical trials are being run together, and some participants may be part of more than one trial. A participant disposition data table pd will have one row per trial in which the participant is taking part. In a given row the participant’s treatment assignment is defined (variable tx) and we also designate whether the participant belongs to the modified intent-to-treat (mitt) sample. For analysis data table d for trial 1 we want to look up its participants’ treatment assignments. Then we look up their assignments but make tx equal to NA if mitt is FALSE.

# Create pd data table
pd <- data.table(id   = .q(a, a, b, c, d, e),
                 trial= .q( trial1, trial2,  trial2,  trial1, trial1, trial1),
                 tx   = .q(placebo, placebo, active, placebo, active, active),
                 mitt = c(    TRUE,    TRUE,   TRUE,   FALSE,   TRUE,  FALSE),
                 key='id')
pd
   id  trial      tx  mitt
1:  a trial1 placebo  TRUE
2:  a trial2 placebo  TRUE
3:  b trial2  active  TRUE
4:  c trial1 placebo FALSE
5:  d trial1  active  TRUE
6:  e trial1  active FALSE
# Create analysis file
d <- data.table(id=.q(a, b, d, e, f), u=c(.1, .3, .2, 0, .4), key='id')
d
   id   u
1:  a 0.1
2:  b 0.3
3:  d 0.2
4:  e 0.0
5:  f 0.4
pd[trial == 'trial1'][d]
   id  trial      tx  mitt   u
1:  a trial1 placebo  TRUE 0.1
2:  b   <NA>    <NA>    NA 0.3
3:  d trial1  active  TRUE 0.2
4:  e trial1  active FALSE 0.0
5:  f   <NA>    <NA>    NA 0.4
pd[trial == 'trial1' & mitt][d]
   id  trial      tx mitt   u
1:  a trial1 placebo TRUE 0.1
2:  b   <NA>    <NA>   NA 0.3
3:  d trial1  active TRUE 0.2
4:  e   <NA>    <NA>   NA 0.0
5:  f   <NA>    <NA>   NA 0.4
ID What Happened
a assigned to placebo and counted as mitt
b in trial2 only, so pd variables NA
d assigned to active and counted as mitt
e assigned to active when mitt ignored, to NA when required mitt
f not in pd so pd variables NA

12.2 Non-equi Joins: Closest Matches

Suppose that we wished to merge records without requiring an exact match, i.e., we want to get closest matches to records on a numeric variable (including dates, times, etc.). Consider the following example in which we create two data tables containing a numeric matching variable that is named differently in the two datasets. For data table a the matching variable is named ax and for b it is bx.

See this for nice examples of rolling joins.
s1 <- c(1, 3, 7, 12, 18, 25, 33, 42, 52)
s2 <- c(1, 4, 8,  9, 23, 27, 31, 50, 70)
a <- data.table(ax=s1, u=s1/100)
a
   ax    u
1:  1 0.01
2:  3 0.03
3:  7 0.07
4: 12 0.12
5: 18 0.18
6: 25 0.25
7: 33 0.33
8: 42 0.42
9: 52 0.52
b <- data.table(bx=s2, v=s2/100)
b
   bx    v
1:  1 0.01
2:  4 0.04
3:  8 0.08
4:  9 0.09
5: 23 0.23
6: 27 0.27
7: 31 0.31
8: 50 0.50
9: 70 0.70
# on=.(var from first data table listed (b), var from second listed (a))
# x.bx means the variable bx from the first table listed (b)
# bx=bx will not work because in that context bx has already been
# replaced with nearest ax
m <- b[a, .(ax=ax, bx=x.bx, u=u, v=v), on=.(bx=ax), roll='nearest']
# If omit .(ax=....,v=v) result only has columns bx, v, u
m
   ax bx    u    v
1:  1  1 0.01 0.01
2:  3  4 0.03 0.04
3:  7  8 0.07 0.08
4: 12  9 0.12 0.09
5: 18 23 0.18 0.23
6: 25 23 0.25 0.23
7: 33 31 0.33 0.31
8: 42 50 0.42 0.50
9: 52 50 0.52 0.50
# Drop matches that even though were closest were more than 3 apart
m[abs(ax - bx) <= 3]
   ax bx    u    v
1:  1  1 0.01 0.01
2:  3  4 0.03 0.04
3:  7  8 0.07 0.08
4: 12  9 0.12 0.09
5: 25 23 0.25 0.23
6: 33 31 0.33 0.31
7: 52 50 0.52 0.50

Alternatively we can set v to NA if the match was not close enough.

m[abs(ax - bx) > 3, v := NA]
m
   ax bx    u    v
1:  1  1 0.01 0.01
2:  3  4 0.03 0.04
3:  7  8 0.07 0.08
4: 12  9 0.12 0.09
5: 18 23 0.18   NA
6: 25 23 0.25 0.23
7: 33 31 0.33 0.31
8: 42 50 0.42   NA
9: 52 50 0.52 0.50

To check the speed of data.table for nearest matching, let’s run an example like the above on two 1,000,000 row data tables that are not sorted by the matching variables. Use the Github reptools function timeMar to put run times in the right margin.

getRs('reptools.r')
set.seed(1)
n <- 1000000
a <- data.table(u=sample(letters, n, TRUE), ax=runif(n))
b <- data.table(v=sample(LETTERS, n, TRUE), bx=runif(n))
# sapply runs the data.table function uniqueN separately on each column, and stores
# the result compactly (here in an integer vector)
# Below we see a few ties in the random uniform draws
c(sapply(a, uniqueN), sapply(b, uniqueN))
 u     ax      v     bx 
26 999867     26 999892 
m <- timeMar(b[a, .(ax=ax, bx=x.bx, u=u, v=v),
               on=.(bx=ax), roll='nearest'])
   user  system elapsed 
  0.899   0.004   0.420s
m
                   ax           bx u v
      1: 0.6479857326 0.6479863366 y T
      2: 0.0008197003 0.0008193317 d P
      3: 0.8371744368 0.8371735101 g H
      4: 0.2001195259 0.2001207869 a Y
      5: 0.2044507302 0.2044508520 b H
     ---                              
 999996: 0.7340266830 0.7340267950 z N
 999997: 0.3529686446 0.3529684865 e R
 999998: 0.3496006471 0.3496009253 h G
 999999: 0.6106638659 0.6106637185 y K
1000000: 0.6509100210 0.6509103342 t H
sapply(m, uniqueN)    # compute number of distinct values in all 4 variables
    ax     bx      u      v 
999867 555325     26     26 

The elapsed time was 0.42 seconds. From the above counts of the numbers of distinct values, an observation from b tended to be the closest match to two observations in a.

Let’s key each data table (which sorts) and check the speed again.

setkey(a, ax)   # took 0.05s
setkey(b, bx)   # " "
timeMar(b[a, .(ax=ax, bx=x.bx, u=u, v=v),
          on=.(bx=ax), roll='nearest'])
   user  system elapsed 
  0.230   0.000   0.128s

See also Section 13.3 and here for more complex non-equi joins and considerations of overlapping intervals from each of two data tables.

12.2.1 Closest Match Within Groups

What if you wanted to find the closest match on a numeric variable x but to do this separately by groups defined by a discrete variable grp? Here is an example showing how.

This is taken from here
A <- data.table(
  grp = c("a", "a", "b", "b"),
  val = 1:4,
  x = c(2.1, 2.2, 1.9, 3)
)

B <- data.table(
  x = c(2, 2.3),
  z = c("foo", "bar")
)

A
   grp val   x
1:   a   1 2.1
2:   a   2 2.2
3:   b   3 1.9
4:   b   4 3.0
B
     x   z
1: 2.0 foo
2: 2.3 bar
A[, .SD[B, on='x', roll='nearest'], by=grp]
   grp val   x   z
1:   a   1 2.0 foo
2:   a   2 2.3 bar
3:   b   3 2.0 foo
4:   b   3 2.3 bar