flowchart LR
More[Simulate More Often!]
Easy[Minimize Coding] --> Tr[data.table/data.frame and array<br>Tricks To Systematize Simulation of<br>Multiple Conditions]


Some of the best ways to validate an analysis are

• If using any model/feature selection methods use the bootstrap to check whether the selection process is volatile, e.g., your sample size isn’t large enough too support making hard-and-fast selections of predictors/features
• Use Monte Carlo simulation to check if the correct model or correct predictors are usually selected
• Simulate a large dataset under a known model and known parameter values and make sure the estimation process you use can recover the true parameter values
• Simulate the statistical performance of a method under a variety of conditions

Unlike papers published in traditional journals which due to space limitations cannot study a huge variety of situations, simulation can study the performance of a method under conditions that mimic yours.

When simulating performance of various quantities under various conditions, creating a large number of variables makes the code long and tedious. It is better to to use data frames/tables or arrays to hold everything together. Data frames and arrays also lead to efficient graphics code for summarization.

## 18.1 Data Table Approach

The expand.grid function is useful for generating all combinations of simulation conditions. Suppose we wanted to simulate statistical properties of the maximum absolute value of the sample correlation coefficient from a matrix of all pairwise correlations from truly uncorrelated variables. We do this while varying the sample size n, the number of variables p, and the type of correlation (Pearson’s or Spearman’s, denoted by r and rho). With expand.grid we don’t need a lot of nested for loops. Run 500 simulations for each condition.

require(Hmisc)
getRs('reptools.r')
require(data.table)
hookaddcap()   # make knitr call a function at the end of each chunk
# to try to automatically add to list of figure

nsim <- 500
R <- expand.grid(n=c(10, 20, 50, 100),
p=c(2, 5, 10, 20),
sim=1 : nsim)
setDT(R)
set.seed(1)
for(i in 1 : nrow(R)) {  # took 4s
w <- R[i, ]
n <- w$n p <- w$p
X <- matrix(rnorm(n * p), ncol=p)
cors    <- cor(X)
maxr    <- max(abs(cors[row(cors) < col(cors)])) # use upper triangle
# Better: max(abs(cors[upper.tri(cors)]))
cors    <- cor(X, method='spearman')
maxrho  <- max(abs(cors[row(cors) < col(cors)]))
set(R, i, 'maxr',   maxr)    # set is in data.table & is very fast
set(R, i, 'maxrho', maxrho)  # set will create the variable if needed
# If not using data.table use this slower approach:
# R[i, 'maxr'] <- maxr   etc.
}

The simulations could have been cached or parallelized as discussed above.

Compute the mean (over simulations) maximum correlation (over variables) and plot the results.

w <- R[, .(maxr = mean(maxr), maxrho=mean(maxrho)), by=.(n, p)]
# Make data table taller and thinner to put r, rho as different observations
u <- melt(w, id.vars=c('n', 'p'), variable.name='type', value.name='r')
u[, type := substring(type, 4)]   # remove "max"
ps <- c(2, 5, 10, 20)
u[, p := factor(p, ps, paste0('p:', ps))]
g <- ggplot(u, aes(x=n, y=r, col=type)) + geom_jitter(height=0, width=2) +
ylim(0, 1) +
facet_wrap(~ p) +
guides(color=guide_legend(title='')) +
ylab('Mean Maximum Correlation Coefficient')
plotly::ggplotly(g)

## 18.2 Array Approach

For large problems, storing results in R arrays is more efficient and doesn’t require duplication of values of n and p over simulations. Once the array is created it can be converted into a data table for graphing.

nsim <- 500
ns   <- c(10, 20, 50, 100)
ps   <- c(2, 5, 10, 20)
R <- array(NA, dim=c(nsim, length(ns), length(ps), 2),
dimnames=list(NULL,
n    = as.character(ns),
p    = as.character(ps),
type = c('r', 'rho')))
dim(R)
[1] 500   4   4   2
dimnames(R)
[[1]]
NULL

$n [1] "10" "20" "50" "100"$p
[1] "2"  "5"  "10" "20"

\$type
[1] "r"   "rho"
set.seed(1)

Note the elegance below in how current simulation results are inserted into the simulation results object R, making use of dimension names as subscripts, except for subscript i for the simulation number which is a ordinary sequential integer subscript. Were the simulated values vectors instead of a scalar (maxr below), we would have used a statement such as R[i, as.character(n), as.character(p), type, ] <- calculated.vector.

for(i in 1 : nsim) {   # took 1s
for(n in ns) {
for(p in ps) {
X <- matrix(rnorm(n * p), ncol=p)
for(type in c('r', 'rho')) {
cors <- cor(X,
method=switch(type, r = 'pearson', rho = 'spearman'))
maxr <- max(abs(cors[row(cors) < col(cors)]))
R[i, as.character(n), as.character(p), type] <- maxr
}
}
}
}

There are many other ways to specify cor(X, method=...). Here are several other codings for method that will yield equivalent result.

fcase(type == 'r', 'pearson', type == 'rho', 'spearman')
fcase(type == 'r', 'pearson', default='spearman')
c(r='pearson', rho='spearman')[type]
.q(r=pearson, rho=spearman)[type]
if(type == 'r') 'pearson' else 'spearman'
ifelse(type == 'r', 'pearson', 'spearman')
# Compute mean (over simulations) maximum correlation for each condition
m <- apply(R, 2:4, mean)   # preserve dimensions 2,3,4 summarize over 1
# Convert the 3-dimensional array to a tall and thin data table
# Generalizations of row() and col() used for 2-dimensional matrices
# comes in handy: slice.index
dn <- dimnames(m)
u <- data.table(r    = as.vector(m),
n    = as.numeric(dn[[1]])[as.vector(slice.index(m, 1))],
p    = as.numeric(dn[[2]])[as.vector(slice.index(m, 2))],
type = dn[[3]][as.vector(slice.index(m, 3))])
# If doing this a lot you may want to write a dimension expander function
slice <- function(a, i) {
dn <- all.is.numeric(dimnames(a)[[i]], 'vector')   # all.is.numeric in Hmisc
dn[as.vector(slice.index(a, i))]
}
u <- data.table(r    = as.vector(m),
n    = slice(m, 1),
p    = slice(m, 2),
type = slice(m, 3))

# Plot u using same ggplot code as above

The result is the same as in Figure 18.1.