flowchart LR More[Simulate More Often!] Easy[Minimize Coding] --> Tr[data.table/data.frame and array<br>Tricks To Systematize Simulation of<br>Multiple Conditions]
18 Simulation
Some of the best ways to validate an analysis are
- If using any model/feature selection methods use the bootstrap to check whether the selection process is volatile, e.g., your sample size isn’t large enough too support making hard-and-fast selections of predictors/features
- Use Monte Carlo simulation to check if the correct model or correct predictors are usually selected
- Simulate a large dataset under a known model and known parameter values and make sure the estimation process you use can recover the true parameter values
- Simulate the statistical performance of a method under a variety of conditions
Unlike papers published in traditional journals which due to space limitations cannot study a huge variety of situations, simulation can study the performance of a method under conditions that mimic yours.
When simulating performance of various quantities under various conditions, creating a large number of variables makes the code long and tedious. It is better to to use data frames/tables or arrays to hold everything together. Data frames and arrays also lead to efficient graphics code for summarization.
18.1 Data Table Approach
The expand.grid
function is useful for generating all combinations of simulation conditions. Suppose we wanted to simulate statistical properties of the maximum absolute value of the sample correlation coefficient from a matrix of all pairwise correlations from truly uncorrelated variables. We do this while varying the sample size n
, the number of variables p
, and the type
of correlation (Pearson’s or Spearman’s, denoted by r
and rho
). With expand.grid
we don’t need a lot of nested for
loops. Run 500 simulations for each condition.
require(Hmisc)
require(qreport)
require(data.table)
hookaddcap() # make knitr call a function at the end of each chunk
# to try to automatically add to list of figure
<- 500
nsim <- expand.grid(n=c(10, 20, 50, 100),
R p=c(2, 5, 10, 20),
sim=1 : nsim)
setDT(R)
set.seed(1)
for(i in 1 : nrow(R)) { # took 4s
<- R[i, ]
w <- w$n
n <- w$p
p <- matrix(rnorm(n * p), ncol=p)
X <- cor(X)
cors <- max(abs(cors[row(cors) < col(cors)])) # use upper triangle
maxr # Better: max(abs(cors[upper.tri(cors)]))
<- cor(X, method='spearman')
cors <- max(abs(cors[row(cors) < col(cors)]))
maxrho set(R, i, 'maxr', maxr) # set is in data.table & is very fast
set(R, i, 'maxrho', maxrho) # set will create the variable if needed
# If not using data.table use this slower approach:
# R[i, 'maxr'] <- maxr etc.
}
The simulations could have been cached or parallelized as discussed above.
Compute the mean (over simulations) maximum correlation (over variables) and plot the results.
<- R[, .(maxr = mean(maxr), maxrho=mean(maxrho)), by=.(n, p)]
w # Make data table taller and thinner to put r, rho as different observations
<- melt(w, id.vars=c('n', 'p'), variable.name='type', value.name='r')
u := substring(type, 4)] # remove "max"
u[, type <- c(2, 5, 10, 20)
ps := factor(p, ps, paste0('p:', ps))]
u[, p <- ggplot(u, aes(x=n, y=r, col=type)) + geom_jitter(height=0, width=2) +
g ylim(0, 1) +
facet_wrap(~ p) +
guides(color=guide_legend(title='')) +
ylab('Mean Maximum Correlation Coefficient')
::ggplotly(g) plotly
18.1.1 expand.grid
with lapply
and rbindlist
If you want to run the simulation combinations with expand.grid
, it is sometimes convenient to have the computations produce a data.table
or list from each row of settings in the expand.grid
output. By using a trick with lapply
we can create a list of data tables, each one computed using parameter values from one row. The data.table
rbindlist
function is extremely fast in combining these results into a single data table.
<- function(alpha, beta) {
f
...# y is a vector, alpha and beta are scalars (auto-expended to length of vector)
<- some.function.of.alpha.beta
y data.table(alpha, beta, y)
}
<- expand.grid(alpha=c(0.01, 0.025, 0.05), beta=c(0.05, 0.1, 0.2))
w setDT(w) # make it a data table
# w[i, f(...)] runs f() on values of alpha, beta in row i of w
# z is a list of data.tables
<- lapply(1 : nrow(w), function(i) w[i, f(alpha, beta)])
z <- rbindlist(z) # stack the lists results
18.2 Array Approach
For large problems, storing results in R array
s is more efficient and doesn’t require duplication of values of n
and p
over simulations. Once the array
is created it can be converted into a data table for graphing.
<- 500
nsim <- c(10, 20, 50, 100)
ns <- c(2, 5, 10, 20)
ps <- array(NA, dim=c(nsim, length(ns), length(ps), 2),
R dimnames=list(NULL,
n = as.character(ns),
p = as.character(ps),
type = c('r', 'rho')))
dim(R)
[1] 500 4 4 2
dimnames(R)
[[1]]
NULL
$n
[1] "10" "20" "50" "100"
$p
[1] "2" "5" "10" "20"
$type
[1] "r" "rho"
set.seed(1)
Note the elegance below in how current simulation results are inserted into the simulation results object R
, making use of dimension names as subscripts, except for subscript i
for the simulation number which is a ordinary sequential integer subscript. Were the simulated values vectors instead of a scalar (maxr
below), we would have used a statement such as R[i, as.character(n), as.character(p), type, ] <- calculated.vector
.
for(i in 1 : nsim) { # took 1s
for(n in ns) {
for(p in ps) {
<- matrix(rnorm(n * p), ncol=p)
X for(type in c('r', 'rho')) {
<- cor(X,
cors method=switch(type, r = 'pearson', rho = 'spearman'))
<- max(abs(cors[row(cors) < col(cors)]))
maxr as.character(n), as.character(p), type] <- maxr
R[i,
}
}
} }
There are many other ways to specify cor(X, method=...)
. Here are several other codings for method
that will yield equivalent result.
fcase(type == 'r', 'pearson', type == 'rho', 'spearman')
fcase(type == 'r', 'pearson', default='spearman')
c(r='pearson', rho='spearman')[type]
.q(r=pearson, rho=spearman)[type]
if(type == 'r') 'pearson' else 'spearman'
ifelse(type == 'r', 'pearson', 'spearman')
# Compute mean (over simulations) maximum correlation for each condition
<- apply(R, 2:4, mean) # preserve dimensions 2,3,4 summarize over 1
m # Convert the 3-dimensional array to a tall and thin data table
# Generalizations of row() and col() used for 2-dimensional matrices
# comes in handy: slice.index
<- dimnames(m)
dn <- data.table(r = as.vector(m),
u n = as.numeric(dn[[1]])[as.vector(slice.index(m, 1))],
p = as.numeric(dn[[2]])[as.vector(slice.index(m, 2))],
type = dn[[3]][as.vector(slice.index(m, 3))])
# If doing this a lot you may want to write a dimension expander function
<- function(a, i) {
slice <- all.is.numeric(dimnames(a)[[i]], 'vector') # all.is.numeric in Hmisc
dn as.vector(slice.index(a, i))]
dn[
}<- data.table(r = as.vector(m),
u n = slice(m, 1),
p = slice(m, 2),
type = slice(m, 3))
# Plot u using same ggplot code as above
The result is the same as in Figure 18.1.