12 Merging Data

flowchart LR
SM[Merge One Record/Subject Data Table With Longitudinal Data]
HM[Hmisc::Merge<br>Monitors Merging Process]
tlu[Merging for Table Look-up]
close[Merging on Closest Matches]

Consider a baseline dataset b and a longitudinal dataset L, with subject ID of id.

For more information see this, this, this, this and this. See this for how to do merging with conditions. To merge any number of datasets at once and obtain a printed report of how the merge went, use the Hmisc Merge function.

require(Hmisc)
require(data.table)
b <- data.table(id=1:4, age=c(21, 28, 32, 23), key='id')
L <- data.table(id  = c(2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5),
                day = c(1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3),
                y    =  1 : 12, key='id')
b

Key: <id>
      id   age
   <int> <num>
1:     1    21
2:     2    28
3:     3    32
4:     4    23

Key: <id>
       id   day     y
    <num> <num> <int>
 1:     2     1     1
 2:     2     2     2
 3:     2     3     3
 4:     3     1     4
 5:     3     2     5
 6:     3     3     6
 7:     3     4     7
 8:     4     1     8
 9:     4     2     9
10:     5     1    10
11:     5     2    11
12:     5     3    12

# Merge b and L to look up baseline age and associate it with all follow-ups
b[L, on=.(id)]   # Keeps all ids in L (left inner join)

       id   age   day     y
    <int> <num> <num> <int>
 1:     2    28     1     1
 2:     2    28     2     2
 3:     2    28     3     3
 4:     3    32     1     4
 5:     3    32     2     5
 6:     3    32     3     6
 7:     3    32     4     7
 8:     4    23     1     8
 9:     4    23     2     9
10:     5    NA     1    10
11:     5    NA     2    11
12:     5    NA     3    12

L[b, on=.(id)]   # Keeps all ids in b (right inner join)

Key: <id>
       id   day     y   age
    <int> <num> <int> <num>
 1:     1    NA    NA    21
 2:     2     1     1    28
 3:     2     2     2    28
 4:     2     3     3    28
 5:     3     1     4    32
 6:     3     2     5    32
 7:     3     3     6    32
 8:     3     4     7    32
 9:     4     1     8    23
10:     4     2     9    23

L[b, on=.(id), nomatch=NULL]  # Keeps only ids in both b and L (right outer join)

Key: <id>
      id   day     y   age
   <int> <num> <int> <num>
1:     2     1     1    28
2:     2     2     2    28
3:     2     3     3    28
4:     3     1     4    32
5:     3     2     5    32
6:     3     3     6    32
7:     3     4     7    32
8:     4     1     8    23
9:     4     2     9    23

uid <- unique(c(b[, id], L[, id]))
L[b[.(uid), on=.(id)]]         # Keeps ids in either b or c (full outer join)

       id   day     y   age
    <int> <num> <int> <num>
 1:     1    NA    NA    21
 2:     2     1     1    28
 3:     2     2     2    28
 4:     2     3     3    28
 5:     3     1     4    32
 6:     3     2     5    32
 7:     3     3     6    32
 8:     3     4     7    32
 9:     4     1     8    23
10:     4     2     9    23
11:     5     1    10    NA
12:     5     2    11    NA
13:     5     3    12    NA

merge(b, L, by='id', all=TRUE) # also full outer join; calls merge.data.table

Key: <id>
       id   age   day     y
    <num> <num> <num> <int>
 1:     1    21    NA    NA
 2:     2    28     1     1
 3:     2    28     2     2
 4:     2    28     3     3
 5:     3    32     1     4
 6:     3    32     2     5
 7:     3    32     3     6
 8:     3    32     4     7
 9:     4    23     1     8
10:     4    23     2     9
11:     5    NA     1    10
12:     5    NA     2    11
13:     5    NA     3    12

# Use the Hmisc Merge function which reports on how the merge went
# Merge operates on both data.tables and data.frames and can handle
# any number of them.  In details printed by Merge below, #1
# refers to the first data table (b).
Merge(b, L, id= ~ id)

       Vars Obs Unique IDs IDs in #1 IDs not in #1
b         2   4          4        NA            NA
L         3  12          4         3             1
Merged    4  13          5         4             1

Number of unique IDs in any data frame : 5 
Number of unique IDs in all data frames: 3

Key: <id>
       id   age   day     y
    <num> <num> <num> <int>
 1:     1    21    NA    NA
 2:     2    28     1     1
 3:     2    28     2     2
 4:     2    28     3     3
 5:     3    32     1     4
 6:     3    32     2     5
 7:     3    32     3     6
 8:     3    32     4     7
 9:     4    23     1     8
10:     4    23     2     9
11:     5    NA     1    10
12:     5    NA     2    11
13:     5    NA     3    12

To help in remembering the order of data tables when joining them, think about how R does subscripting. a[b] for vectors a and b results in length(b) observations when b is a vector of positive integers or is a character vector corresponding to names(a). Likewise, for two data tables a, b, a[b] results in a data table with the number of rows equal to the number of rows in b.

For very large data tables, giving the data tables keys may speed execution, e.g.:

See Section 10.1.2 and this highly useful information, including a recommendation to be explicit with on= instead of relying on somewhat hidden keys.

setkey(d, id)
setkey(d, state, city)

Join/merge can be used for data lookups:

s <- data.table(st=.q(AL, AK, AZ, CA, OK), y=5:1)
stateAbbrevs <- data.table(state=state.abb, State=state.name)
s[stateAbbrevs, , on=.(st=state), nomatch=NULL]

       st     y      State
   <char> <int>     <char>
1:     AL     5    Alabama
2:     AK     4     Alaska
3:     AZ     3    Arizona
4:     CA     2 California
5:     OK     1   Oklahoma

12.1 Lookup Participant Disposition

In most situations it is useful to permanently hold certain subject-level variables inside the primary analysis file. In others it is more useful to have a separate dataset that defines categories to which a subject belongs, and to look up those categories by matching on ID. This is particularly useful when there are many analysis tables and we wish to defer calculation of some of the derived variables. The following example shows how to do this.

Suppose that multiple clinical trials are being run together, and some participants may be part of more than one trial. A participant disposition data table pd will have one row per trial in which the participant is taking part. In a given row the participant’s treatment assignment is defined (variable tx) and we also designate whether the participant belongs to the modified intent-to-treat (mitt) sample. For analysis data table d for trial 1 we want to look up its participants’ treatment assignments. Then we look up their assignments but make tx equal to NA if mitt is FALSE.

# Create pd data table
pd <- data.table(id   = .q(a, a, b, c, d, e),
                 trial= .q( trial1, trial2,  trial2,  trial1, trial1, trial1),
                 tx   = .q(placebo, placebo, active, placebo, active, active),
                 mitt = c(    TRUE,    TRUE,   TRUE,   FALSE,   TRUE,  FALSE),
                 key='id')
pd

Key: <id>
       id  trial      tx   mitt
   <char> <char>  <char> <lgcl>
1:      a trial1 placebo   TRUE
2:      a trial2 placebo   TRUE
3:      b trial2  active   TRUE
4:      c trial1 placebo  FALSE
5:      d trial1  active   TRUE
6:      e trial1  active  FALSE

# Create analysis file
d <- data.table(id=.q(a, b, d, e, f), u=c(.1, .3, .2, 0, .4), key='id')
d

Key: <id>
       id     u
   <char> <num>
1:      a   0.1
2:      b   0.3
3:      d   0.2
4:      e   0.0
5:      f   0.4

pd[trial == 'trial1'][d]

Key: <id>
       id  trial      tx   mitt     u
   <char> <char>  <char> <lgcl> <num>
1:      a trial1 placebo   TRUE   0.1
2:      b   <NA>    <NA>     NA   0.3
3:      d trial1  active   TRUE   0.2
4:      e trial1  active  FALSE   0.0
5:      f   <NA>    <NA>     NA   0.4

pd[trial == 'trial1' & mitt][d]

Key: <id>
       id  trial      tx   mitt     u
   <char> <char>  <char> <lgcl> <num>
1:      a trial1 placebo   TRUE   0.1
2:      b   <NA>    <NA>     NA   0.3
3:      d trial1  active   TRUE   0.2
4:      e   <NA>    <NA>     NA   0.0
5:      f   <NA>    <NA>     NA   0.4

ID	What Happened
a	assigned to `placebo` and counted as mitt
b	in `trial2` only, so `pd` variables `NA`
d	assigned to `active` and counted as mitt
e	assigned to `active` when mitt ignored, to `NA` when required mitt
f	not in `pd` so `pd` variables `NA`

12.2 Non-equi Joins: Closest Matches

Suppose that we wished to merge records without requiring an exact match, i.e., we want to get closest matches to records on a numeric variable (including dates, times, etc.). Consider the following example in which we create two data tables containing a numeric matching variable that is named differently in the two datasets. For data table a the matching variable is named ax and for b it is bx.

See this for nice examples of rolling joins.

s1 <- c(1, 3, 7, 12, 18, 25, 33, 42, 52)
s2 <- c(1, 4, 8,  9, 23, 27, 31, 50, 70)
a <- data.table(ax=s1, u=s1/100)
a

      ax     u
   <num> <num>
1:     1  0.01
2:     3  0.03
3:     7  0.07
4:    12  0.12
5:    18  0.18
6:    25  0.25
7:    33  0.33
8:    42  0.42
9:    52  0.52

b <- data.table(bx=s2, v=s2/100)
b

      bx     v
   <num> <num>
1:     1  0.01
2:     4  0.04
3:     8  0.08
4:     9  0.09
5:    23  0.23
6:    27  0.27
7:    31  0.31
8:    50  0.50
9:    70  0.70

# on=.(var from first data table listed (b), var from second listed (a))
# x.bx means the variable bx from the first table listed (b)
# bx=bx will not work because in that context bx has already been
# replaced with nearest ax
m <- b[a, .(ax=ax, bx=x.bx, u=u, v=v), on=.(bx=ax), roll='nearest']
# If omit .(ax=....,v=v) result only has columns bx, v, u
m

      ax    bx     u     v
   <num> <num> <num> <num>
1:     1     1  0.01  0.01
2:     3     4  0.03  0.04
3:     7     8  0.07  0.08
4:    12     9  0.12  0.09
5:    18    23  0.18  0.23
6:    25    23  0.25  0.23
7:    33    31  0.33  0.31
8:    42    50  0.42  0.50
9:    52    50  0.52  0.50

# Drop matches that even though were closest were more than 3 apart
m[abs(ax - bx) <= 3]

      ax    bx     u     v
   <num> <num> <num> <num>
1:     1     1  0.01  0.01
2:     3     4  0.03  0.04
3:     7     8  0.07  0.08
4:    12     9  0.12  0.09
5:    25    23  0.25  0.23
6:    33    31  0.33  0.31
7:    52    50  0.52  0.50

Alternatively we can set v to NA if the match was not close enough.

m[abs(ax - bx) > 3, v := NA]
m

      ax    bx     u     v
   <num> <num> <num> <num>
1:     1     1  0.01  0.01
2:     3     4  0.03  0.04
3:     7     8  0.07  0.08
4:    12     9  0.12  0.09
5:    18    23  0.18    NA
6:    25    23  0.25  0.23
7:    33    31  0.33  0.31
8:    42    50  0.42    NA
9:    52    50  0.52  0.50

To check the speed of data.table for nearest matching, let’s run an example like the above on two 1,000,000 row data tables that are not sorted by the matching variables. Use the qreport function timeMar to put run times in the right margin.

require(qreport)
set.seed(1)
n <- 1000000
a <- data.table(u=sample(letters, n, TRUE), ax=runif(n))
b <- data.table(v=sample(LETTERS, n, TRUE), bx=runif(n))
# sapply runs the data.table function uniqueN separately on each column, and stores
# the result compactly (here in an integer vector)
# Below we see a few ties in the random uniform draws
c(sapply(a, uniqueN), sapply(b, uniqueN))

 u     ax      v     bx 
26 999867     26 999892

m <- timeMar(b[a, .(ax=ax, bx=x.bx, u=u, v=v),
               on=.(bx=ax), roll='nearest'])

   user  system elapsed 
  1.163   0.060   0.204s

                   ax           bx      u      v
                <num>        <num> <char> <char>
      1: 0.6479857326 0.6479863366      y      T
      2: 0.0008197003 0.0008193317      d      P
      3: 0.8371744368 0.8371735101      g      H
      4: 0.2001195259 0.2001207869      a      Y
      5: 0.2044507302 0.2044508520      b      H
     ---                                        
 999996: 0.7340266830 0.7340267950      z      N
 999997: 0.3529686446 0.3529684865      e      R
 999998: 0.3496006471 0.3496009253      h      G
 999999: 0.6106638659 0.6106637185      y      K
1000000: 0.6509100210 0.6509103342      t      H

sapply(m, uniqueN)    # compute number of distinct values in all 4 variables

    ax     bx      u      v 
999867 555325     26     26

The elapsed time was 0.204 seconds. From the above counts of the numbers of distinct values, an observation from b tended to be the closest match to two observations in a.

Let’s key each data table (which sorts) and check the speed again.

setkey(a, ax)   # took 0.05s
setkey(b, bx)   # " "
timeMar(b[a, .(ax=ax, bx=x.bx, u=u, v=v),
          on=.(bx=ax), roll='nearest'])

   user  system elapsed 
  0.462   0.026   0.082s

See also Section 13.7 and here for more complex non-equi joins and considerations of overlapping intervals from each of two data tables.

12.2.1 Closest Match Within Groups

What if you wanted to find the closest match on a numeric variable x but to do this separately by groups defined by a discrete variable grp? Here is an example showing how.

This is taken from here

A <- data.table(
  grp = c("a", "a", "b", "b"),
  val = 1:4,
  x = c(2.1, 2.2, 1.9, 3)
)

B <- data.table(
  x = c(2, 2.3),
  z = c("foo", "bar")
)

A

      grp   val     x
   <char> <int> <num>
1:      a     1   2.1
2:      a     2   2.2
3:      b     3   1.9
4:      b     4   3.0

       x      z
   <num> <char>
1:   2.0    foo
2:   2.3    bar

A[, .SD[B, on='x', roll='nearest'], by=grp]

      grp   val     x      z
   <char> <int> <num> <char>
1:      a     1   2.0    foo
2:      a     2   2.3    bar
3:      b     3   2.0    foo
4:      b     3   2.3    bar