How Does a Compound Symmetric Correlation Structure Translate to a Markov Model?
Random intercepts in a model induce a compound symmetric correlation structure in which the correlation between two responses on the same subject have the same correlation no matter the time gap between the responses. How does one capture such an unnatural structure using a Markov semiparametric model?
Let’s generate some data with repeatedly measured outcome per subject where the outcome is a coursened continuous response from a linear model, treated as ordinal, and the random effects have a \(N(0, 0.25^2)\) distribution. Generate 5000 subjects having 10 measurements each.
Code
< 5000 # subjects
n set.seed(2)
< rnorm(n) * 0.25
re < runif(n) # baseline covariate, will be duplicated over time within subject
X < 10 # measurements per subject
m
< rep(1 : n, each = m)
id < X[id]
x < x + re[id] + rnorm(n * m) * 0.5
y < pmin(2, pmax(1, y)) # curtail y
y < round(4 * y) # coursen it
y < data.table(id, x, y)
d := 1 : .N, by=id]
d[, t # Make short and wide dataset
< dcast(d[, .(id, t, y)],
w ~ t, value.var='y')
id < cor(as.matrix(w[, 1]), use='pairwise.complete.obs',
r method='pearson')
< plotCorrM(r)
p 1]] p[[
Code
2]] + ylim(0,1) p[[
The flat correlation pattern vs. time gap is as expected.
Fit a firstorder Markov PO model to times 2, 3, …, 10. Allow previous y to interact with a flexible function of time.
Code
:= shift(y), by=id]
d[, yprev < orm(y ~ pol(x, 2) + pol(yprev, 2) * pol(t, 3), data=d)
f anova(f)
Wald Statistics for y 


χ^{2}  d.f.  P  
x  5565.78  2  <0.0001 
Nonlinear  6.28  1  0.0122 
yprev (Factor+Higher Order Factors)  1696.29  8  <0.0001 
All Interactions  3.77  6  0.7082 
Nonlinear (Factor+Higher Order Factors)  0.68  4  0.9544 
t (Factor+Higher Order Factors)  5.94  9  0.7458 
All Interactions  3.77  6  0.7082 
Nonlinear (Factor+Higher Order Factors)  2.01  6  0.9188 
yprev × t (Factor+Higher Order Factors)  3.77  6  0.7082 
Nonlinear  1.50  5  0.9126 
Nonlinear Interaction : f(A,B) vs. AB  1.50  5  0.9126 
f(A,B) vs. Af(B) + Bg(A)  0.31  2  0.8560 
Nonlinear Interaction in yprev vs. Af(B)  0.43  3  0.9346 
Nonlinear Interaction in t vs. Bg(A)  1.38  4  0.8472 
TOTAL NONLINEAR  8.41  9  0.4938 
TOTAL NONLINEAR + INTERACTION  10.65  10  0.3853 
TOTAL  11187.92  13  <0.0001 
Code
< orm(y ~ pol(x,2) + yprev + t, data=d)
g g
Logistic (Proportional Odds) Ordinal Regression Model
orm(formula = y ~ pol(x, 2) + yprev + t, data = d)
Frequencies of Responses
4 3 2 1 0 1 2 3 4 5 6 7 8 634 1012 2109 3494 5106 6413 6908 6529 5315 3620 2106 1090 664Frequencies of Missing Values Due to Each Variable
y x yprev t 0 0 5000 0
Model Likelihood Ratio Test 
Discrimination Indexes 
Rank Discrim. Indexes 


Obs 45000  LR χ^{2} 12078.42  R^{2} 0.238  ρ 0.488 
Distinct Y 13  d.f. 4  R^{2}_{4,45000} 0.235  
Y_{0.5} 2  Pr(>χ^{2}) <0.0001  R^{2}_{4,44374.5} 0.238  
max ∂log L/∂β 1×10^{6}  Score χ^{2} 12153.40  Pr(Y ≥ median)½ 0.194  
Pr(>χ^{2}) <0.0001 
β  S.E.  Wald Z  Pr(>Z)  

x  2.7847  0.1160  24.01  <0.0001 
x^{2}  0.2728  0.1113  2.45  0.0142 
yprev  0.1551  0.0038  41.14  <0.0001 
t  0.0040  0.0032  1.24  0.2136 
There was no evidence for a nonlinear effect of previous y, or an interaction between previous value and time.
FifthOrder Markov Model with Linear Effects
Code
:= shift(yprev), by=id]
d[, yprev2 := shift(yprev2), by=id]
d[, yprev3 := shift(yprev3), by=id]
d[, yprev4 := shift(yprev4), by=id]
d[, yprev5 < orm(y ~ pol(x, 2) + yprev + yprev2 + yprev3 + yprev4 + yprev5 + t, data=d)
f f
Logistic (Proportional Odds) Ordinal Regression Model
orm(formula = y ~ pol(x, 2) + yprev + yprev2 + yprev3 + yprev4 + yprev5 + t, data = d)
Frequencies of Responses
4 3 2 1 0 1 2 3 4 5 6 7 8 337 540 1183 1958 2754 3601 3855 3647 2964 2015 1167 622 357
Model Likelihood Ratio Test 
Discrimination Indexes 
Rank Discrim. Indexes 


Obs 25000  LR χ^{2} 8641.08  R^{2} 0.295  ρ 0.538 
Distinct Y 13  d.f. 8  R^{2}_{8,25000} 0.292  
Y_{0.5} 2  Pr(>χ^{2}) <0.0001  R^{2}_{8,24649.9} 0.295  
max ∂log L/∂β 2×10^{6}  Score χ^{2} 8831.52  Pr(Y ≥ median)½ 0.212  
Pr(>χ^{2}) <0.0001 
β  S.E.  Wald Z  Pr(>Z)  

x  1.6573  0.1579  10.49  <0.0001 
x^{2}  0.2336  0.1494  1.56  0.1180 
yprev  0.0949  0.0053  17.91  <0.0001 
yprev2  0.0956  0.0053  18.12  <0.0001 
yprev3  0.1062  0.0053  20.07  <0.0001 
yprev4  0.0882  0.0052  16.80  <0.0001 
yprev5  0.0771  0.0052  14.70  <0.0001 
t  0.0060  0.0079  0.76  0.4488 
Code
anova(f, yprev2, yprev3, yprev4, yprev5)
Wald Statistics for y 


χ^{2}  d.f.  P  
yprev2  328.18  1  <0.0001 
yprev3  402.61  1  <0.0001 
yprev4  282.37  1  <0.0001 
yprev5  216.13  1  <0.0001 
TOTAL  1833.15  4  <0.0001 
For the compound symmetric correlation structure the effect of all previous values of y is the same, and all lags are important.
See if the mean of all previous values can replace all the individual lagged variables.
Code
:= (yprev + yprev2 + yprev3 + yprev4 + yprev5) / 5]
d[, ypmean < orm(y ~ pol(x, 2) + ypmean + t, data=d)
fm fm
Logistic (Proportional Odds) Ordinal Regression Model
orm(formula = y ~ pol(x, 2) + ypmean + t, data = d)
Frequencies of Responses
4 3 2 1 0 1 2 3 4 5 6 7 8 337 540 1183 1958 2754 3601 3855 3647 2964 2015 1167 622 357Frequencies of Missing Values Due to Each Variable
y x ypmean t 0 0 25000 0
Model Likelihood Ratio Test 
Discrimination Indexes 
Rank Discrim. Indexes 


Obs 25000  LR χ^{2} 8626.32  R^{2} 0.295  ρ 0.538 
Distinct Y 13  d.f. 4  R^{2}_{4,25000} 0.292  
Y_{0.5} 2  Pr(>χ^{2}) <0.0001  R^{2}_{4,24649.9} 0.295  
max ∂log L/∂β 2×10^{6}  Score χ^{2} 8815.53  Pr(Y ≥ median)½ 0.211  
Pr(>χ^{2}) <0.0001 
β  S.E.  Wald Z  Pr(>Z)  

x  1.6530  0.1579  10.47  <0.0001 
x^{2}  0.2297  0.1494  1.54  0.1243 
ypmean  0.4618  0.0087  52.90  <0.0001 
t  0.0061  0.0079  0.78  0.4355 
Code
AIC(fm); AIC(f)
[1] 107771.7 [1] 107765
The previous model is better but by an inconsequential amount. The compound symmetric correlation pattern is reproduced by conditioning on the mean of all previous response values.
Combined Markov and Random Effects Model
Add random intercepts and see if lags are still important in a Bayesian PO model.
Code
require(rmsb)
options(mc.cores=4, rmsb.backend='cmdstan') # took 7m on 4 cores (4 chains)
< blrm(y ~ pol(x, 2) + yprev + yprev2 + yprev3 + yprev4 + yprev5 + t +
b cluster(id), data=d, file='corstructb.rds')
b
Bayesian Proportional Odds Ordinal Logistic Model
Dirichlet Priors With Concentration Parameter 0.187 for Intercepts
blrm(formula = y ~ pol(x, 2) + yprev + yprev2 + yprev3 + yprev4 + yprev5 + t + cluster(id), data = d, file = "corstructb.rds")
Frequencies of Responses
4 3 2 1 0 1 2 3 4 5 6 7 8 337 540 1183 1958 2754 3601 3855 3647 2964 2015 1167 622 357
Mixed Calibration/ Discrimination Indexes 
Discrimination Indexes 
Rank Discrim. Indexes 


Obs 25000  B 0.194 [0.194, 0.194]  g 1.301 [1.273, 1.327]  C 0.704 [0.704, 0.704] 
Draws 4000  g_{p} 0.263 [0.259, 0.267]  D_{xy} 0.408 [0.408, 0.409]  
Chains 4  EV 0.217 [0.21, 0.223]  
Time 369.7s  v 1.29 [1.237, 1.342]  
p 8  vp 0.053 [0.051, 0.054]  
Cluster on id 

Clusters 5000  
σ_{γ} 0.0328 [1e04, 0.0966] 
Mean β  Median β  S.E.  Lower  Upper  Pr(β>0)  Symmetry  

x  1.6592  1.6587  0.1567  1.3281  1.9378  1.0000  0.98 
x^{2}  0.2314  0.2334  0.1482  0.5081  0.0684  0.0583  1.03 
yprev  0.0946  0.0946  0.0052  0.0847  0.1052  1.0000  0.99 
yprev2  0.0954  0.0954  0.0051  0.0855  0.1048  1.0000  0.95 
yprev3  0.1061  0.1061  0.0054  0.0955  0.1166  1.0000  1.03 
yprev4  0.0883  0.0882  0.0049  0.0784  0.0977  1.0000  0.99 
yprev5  0.0773  0.0773  0.0052  0.0674  0.0875  1.0000  1.01 
t  0.0061  0.0061  0.0083  0.0093  0.0228  0.7562  1.03 
With the lagged variables in the model the standard deviation \(\sigma_\gamma\) of the random effects is very small, and oddly enough the previous values remain important. I conjecture that if the lagged variables were removed \(\sigma_\gamma\) would be much larger and the model would be about as good.
Computing Environment
`R version 4.3.2 (20231031) Platform: aarch64appledarwin20 (64bit) Running under: macOS Ventura 13.6.1 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.3arm64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.3arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rmsb_1.00 ggplot2_3.4.4 data.table_1.14.8 rms_6.80 [5] Hmisc_5.12To cite R in publications use:
R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.Rproject.org/.
To cite the Hmisc package in publications use:
Harrell Jr F (2023). Hmisc: Harrell Miscellaneous. R package version 5.12, https://hbiostat.org/R/Hmisc/.
To cite the rms package in publications use:
Harrell Jr FE (2023). rms: Regression Modeling Strategies. R package version 6.80, https://github.com/harrelfe/rms, https://hbiostat.org/R/rms/.
To cite the rmsb package in publications use:
Harrell F (2023). rmsb: Bayesian Regression Modeling Strategies. R package version 1.00, https://hbiostat.org/R/rmsb/.
To cite the data.table package in publications use:
Dowle M, Srinivasan A (2023). data.table: Extension of ‘data.frame’. R package version 1.14.8, https://CRAN.Rproject.org/package=data.table.
To cite the ggplot2 package in publications use:
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. SpringerVerlag New York. ISBN 9783319242774, https://ggplot2.tidyverse.org.
`{=html}