12 Merging Data
Consider a baseline dataset b
and a longitudinal dataset L
, with subject ID of id
.
require(Hmisc)
require(data.table)
<- data.table(id=1:4, age=c(21, 28, 32, 23), key='id')
b <- data.table(id = c(2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5),
L day = c(1, 2, 3, 1, 2, 3, 4, 1, 2, 1, 2, 3),
y = 1 : 12, key='id')
b
Key: <id>
id age
<int> <num>
1: 1 21
2: 2 28
3: 3 32
4: 4 23
L
Key: <id>
id day y
<num> <num> <int>
1: 2 1 1
2: 2 2 2
3: 2 3 3
4: 3 1 4
5: 3 2 5
6: 3 3 6
7: 3 4 7
8: 4 1 8
9: 4 2 9
10: 5 1 10
11: 5 2 11
12: 5 3 12
# Merge b and L to look up baseline age and associate it with all follow-ups
=.(id)] # Keeps all ids in L (left inner join) b[L, on
id age day y
<int> <num> <num> <int>
1: 2 28 1 1
2: 2 28 2 2
3: 2 28 3 3
4: 3 32 1 4
5: 3 32 2 5
6: 3 32 3 6
7: 3 32 4 7
8: 4 23 1 8
9: 4 23 2 9
10: 5 NA 1 10
11: 5 NA 2 11
12: 5 NA 3 12
=.(id)] # Keeps all ids in b (right inner join) L[b, on
Key: <id>
id day y age
<int> <num> <int> <num>
1: 1 NA NA 21
2: 2 1 1 28
3: 2 2 2 28
4: 2 3 3 28
5: 3 1 4 32
6: 3 2 5 32
7: 3 3 6 32
8: 3 4 7 32
9: 4 1 8 23
10: 4 2 9 23
=.(id), nomatch=NULL] # Keeps only ids in both b and L (right outer join) L[b, on
Key: <id>
id day y age
<int> <num> <int> <num>
1: 2 1 1 28
2: 2 2 2 28
3: 2 3 3 28
4: 3 1 4 32
5: 3 2 5 32
6: 3 3 6 32
7: 3 4 7 32
8: 4 1 8 23
9: 4 2 9 23
<- unique(c(b[, id], L[, id]))
uid =.(id)]] # Keeps ids in either b or c (full outer join) L[b[.(uid), on
id day y age
<int> <num> <int> <num>
1: 1 NA NA 21
2: 2 1 1 28
3: 2 2 2 28
4: 2 3 3 28
5: 3 1 4 32
6: 3 2 5 32
7: 3 3 6 32
8: 3 4 7 32
9: 4 1 8 23
10: 4 2 9 23
11: 5 1 10 NA
12: 5 2 11 NA
13: 5 3 12 NA
merge(b, L, by='id', all=TRUE) # also full outer join; calls merge.data.table
Key: <id>
id age day y
<num> <num> <num> <int>
1: 1 21 NA NA
2: 2 28 1 1
3: 2 28 2 2
4: 2 28 3 3
5: 3 32 1 4
6: 3 32 2 5
7: 3 32 3 6
8: 3 32 4 7
9: 4 23 1 8
10: 4 23 2 9
11: 5 NA 1 10
12: 5 NA 2 11
13: 5 NA 3 12
# Use the Hmisc Merge function which reports on how the merge went
# Merge operates on both data.tables and data.frames and can handle
# any number of them. In details printed by Merge below, #1
# refers to the first data table (b).
Merge(b, L, id= ~ id)
Vars Obs Unique IDs IDs in #1 IDs not in #1
b 2 4 4 NA NA
L 3 12 4 3 1
Merged 4 13 5 4 1
Number of unique IDs in any data frame : 5
Number of unique IDs in all data frames: 3
Key: <id>
id age day y
<num> <num> <num> <int>
1: 1 21 NA NA
2: 2 28 1 1
3: 2 28 2 2
4: 2 28 3 3
5: 3 32 1 4
6: 3 32 2 5
7: 3 32 3 6
8: 3 32 4 7
9: 4 23 1 8
10: 4 23 2 9
11: 5 NA 1 10
12: 5 NA 2 11
13: 5 NA 3 12
To help in remembering the order of data tables when joining them, think about how R does subscripting. a[b]
for vectors a
and b
results in length(b)
observations when b
is a vector of positive integers or is a character vector corresponding to names(a)
. Likewise, for two data tables a
, b
, a[b]
results in a data table with the number of rows equal to the number of rows in b
.
For very large data tables, giving the data tables keys may speed execution, e.g.:
on=
instead of relying on somewhat hidden keys
.setkey(d, id)
setkey(d, state, city)
Join/merge can be used for data lookups:
<- data.table(st=.q(AL, AK, AZ, CA, OK), y=5:1)
s <- data.table(state=state.abb, State=state.name)
stateAbbrevs =.(st=state), nomatch=NULL] s[stateAbbrevs, , on
st y State
<char> <int> <char>
1: AL 5 Alabama
2: AK 4 Alaska
3: AZ 3 Arizona
4: CA 2 California
5: OK 1 Oklahoma
12.1 Lookup Participant Disposition
In most situations it is useful to permanently hold certain subject-level variables inside the primary analysis file. In others it is more useful to have a separate dataset that defines categories to which a subject belongs, and to look up those categories by matching on ID. This is particularly useful when there are many analysis tables and we wish to defer calculation of some of the derived variables. The following example shows how to do this.
Suppose that multiple clinical trials are being run together, and some participants may be part of more than one trial. A participant disposition data table pd
will have one row per trial in which the participant is taking part. In a given row the participant’s treatment assignment is defined (variable tx
) and we also designate whether the participant belongs to the modified intent-to-treat (mitt
) sample. For analysis data table d
for trial 1 we want to look up its participants’ treatment assignments. Then we look up their assignments but make tx
equal to NA
if mitt
is FALSE
.
# Create pd data table
<- data.table(id = .q(a, a, b, c, d, e),
pd trial= .q( trial1, trial2, trial2, trial1, trial1, trial1),
tx = .q(placebo, placebo, active, placebo, active, active),
mitt = c( TRUE, TRUE, TRUE, FALSE, TRUE, FALSE),
key='id')
pd
Key: <id>
id trial tx mitt
<char> <char> <char> <lgcl>
1: a trial1 placebo TRUE
2: a trial2 placebo TRUE
3: b trial2 active TRUE
4: c trial1 placebo FALSE
5: d trial1 active TRUE
6: e trial1 active FALSE
# Create analysis file
<- data.table(id=.q(a, b, d, e, f), u=c(.1, .3, .2, 0, .4), key='id')
d d
Key: <id>
id u
<char> <num>
1: a 0.1
2: b 0.3
3: d 0.2
4: e 0.0
5: f 0.4
== 'trial1'][d] pd[trial
Key: <id>
id trial tx mitt u
<char> <char> <char> <lgcl> <num>
1: a trial1 placebo TRUE 0.1
2: b <NA> <NA> NA 0.3
3: d trial1 active TRUE 0.2
4: e trial1 active FALSE 0.0
5: f <NA> <NA> NA 0.4
== 'trial1' & mitt][d] pd[trial
Key: <id>
id trial tx mitt u
<char> <char> <char> <lgcl> <num>
1: a trial1 placebo TRUE 0.1
2: b <NA> <NA> NA 0.3
3: d trial1 active TRUE 0.2
4: e <NA> <NA> NA 0.0
5: f <NA> <NA> NA 0.4
ID | What Happened |
---|---|
a | assigned to placebo and counted as mitt |
b | in trial2 only, so pd variables NA |
d | assigned to active and counted as mitt |
e | assigned to active when mitt ignored, to NA when required mitt |
f | not in pd so pd variables NA |
12.2 Non-equi Joins: Closest Matches
Suppose that we wished to merge records without requiring an exact match, i.e., we want to get closest matches to records on a numeric variable (including dates, times, etc.). Consider the following example in which we create two data tables containing a numeric matching variable that is named differently in the two datasets. For data table a
the matching variable is named ax
and for b
it is bx
.
<- c(1, 3, 7, 12, 18, 25, 33, 42, 52)
s1 <- c(1, 4, 8, 9, 23, 27, 31, 50, 70)
s2 <- data.table(ax=s1, u=s1/100)
a a
ax u
<num> <num>
1: 1 0.01
2: 3 0.03
3: 7 0.07
4: 12 0.12
5: 18 0.18
6: 25 0.25
7: 33 0.33
8: 42 0.42
9: 52 0.52
<- data.table(bx=s2, v=s2/100)
b b
bx v
<num> <num>
1: 1 0.01
2: 4 0.04
3: 8 0.08
4: 9 0.09
5: 23 0.23
6: 27 0.27
7: 31 0.31
8: 50 0.50
9: 70 0.70
# on=.(var from first data table listed (b), var from second listed (a))
# x.bx means the variable bx from the first table listed (b)
# bx=bx will not work because in that context bx has already been
# replaced with nearest ax
<- b[a, .(ax=ax, bx=x.bx, u=u, v=v), on=.(bx=ax), roll='nearest']
m # If omit .(ax=....,v=v) result only has columns bx, v, u
m
ax bx u v
<num> <num> <num> <num>
1: 1 1 0.01 0.01
2: 3 4 0.03 0.04
3: 7 8 0.07 0.08
4: 12 9 0.12 0.09
5: 18 23 0.18 0.23
6: 25 23 0.25 0.23
7: 33 31 0.33 0.31
8: 42 50 0.42 0.50
9: 52 50 0.52 0.50
# Drop matches that even though were closest were more than 3 apart
abs(ax - bx) <= 3] m[
ax bx u v
<num> <num> <num> <num>
1: 1 1 0.01 0.01
2: 3 4 0.03 0.04
3: 7 8 0.07 0.08
4: 12 9 0.12 0.09
5: 25 23 0.25 0.23
6: 33 31 0.33 0.31
7: 52 50 0.52 0.50
Alternatively we can set v
to NA
if the match was not close enough.
abs(ax - bx) > 3, v := NA]
m[ m
ax bx u v
<num> <num> <num> <num>
1: 1 1 0.01 0.01
2: 3 4 0.03 0.04
3: 7 8 0.07 0.08
4: 12 9 0.12 0.09
5: 18 23 0.18 NA
6: 25 23 0.25 0.23
7: 33 31 0.33 0.31
8: 42 50 0.42 NA
9: 52 50 0.52 0.50
To check the speed of data.table
for nearest matching, let’s run an example like the above on two 1,000,000 row data tables that are not sorted by the matching variables. Use the qreport
function timeMar
to put run times in the right margin.
require(qreport)
set.seed(1)
<- 1000000
n <- data.table(u=sample(letters, n, TRUE), ax=runif(n))
a <- data.table(v=sample(LETTERS, n, TRUE), bx=runif(n))
b # sapply runs the data.table function uniqueN separately on each column, and stores
# the result compactly (here in an integer vector)
# Below we see a few ties in the random uniform draws
c(sapply(a, uniqueN), sapply(b, uniqueN))
u ax v bx
26 999867 26 999892
<- timeMar(b[a, .(ax=ax, bx=x.bx, u=u, v=v),
m on=.(bx=ax), roll='nearest'])
user system elapsed
1.178 0.073 0.211s
m
ax bx u v
<num> <num> <char> <char>
1: 0.6479857326 0.6479863366 y T
2: 0.0008197003 0.0008193317 d P
3: 0.8371744368 0.8371735101 g H
4: 0.2001195259 0.2001207869 a Y
5: 0.2044507302 0.2044508520 b H
---
999996: 0.7340266830 0.7340267950 z N
999997: 0.3529686446 0.3529684865 e R
999998: 0.3496006471 0.3496009253 h G
999999: 0.6106638659 0.6106637185 y K
1000000: 0.6509100210 0.6509103342 t H
sapply(m, uniqueN) # compute number of distinct values in all 4 variables
ax bx u v
999867 555325 26 26
The elapsed time was 0.211 seconds. From the above counts of the numbers of distinct values, an observation from b
tended to be the closest match to two observations in a
.
Let’s key each data table (which sorts) and check the speed again.
setkey(a, ax) # took 0.05s
setkey(b, bx) # " "
timeMar(b[a, .(ax=ax, bx=x.bx, u=u, v=v),
on=.(bx=ax), roll='nearest'])
user system elapsed
0.455 0.030 0.081s
See also Section 13.7 and here for more complex non-equi joins and considerations of overlapping intervals from each of two data tables.
12.2.1 Closest Match Within Groups
What if you wanted to find the closest match on a numeric variable x
but to do this separately by groups defined by a discrete variable grp
? Here is an example showing how.
<- data.table(
A grp = c("a", "a", "b", "b"),
val = 1:4,
x = c(2.1, 2.2, 1.9, 3)
)
<- data.table(
B x = c(2, 2.3),
z = c("foo", "bar")
)
A
grp val x
<char> <int> <num>
1: a 1 2.1
2: a 2 2.2
3: b 3 1.9
4: b 4 3.0
B
x z
<num> <char>
1: 2.0 foo
2: 2.3 bar
='x', roll='nearest'], by=grp] A[, .SD[B, on
grp val x z
<char> <int> <num> <char>
1: a 1 2.0 foo
2: a 2 2.3 bar
3: b 3 2.0 foo
4: b 3 2.3 bar