MSCI Biostatistics II Homework Assignments

Authors

Affiliations

Tom Stewart

University of Virginia
School of Data Science

Frank Harrell

Vanderbilt University School of Medicine
Department of Biostatistics

Published

February 7, 2023

Unless stated otherwise these problems correspond to The Analysis of Biological Data (ABD) practice problems. ABD datasets may be fetched using commands like that below, embedded into your analysis script.

require(Hmisc)   # make Hmisc available if not done earlier
                 # note: Hmisc is always there if require(rms) was done earlier
getRs('abd.r')   # fetch abd function
# Fetch csv file from Github, read it, and store as data frame d
d <- getabd('dataset name')   # e.g. getabd('17-q-01')

Homework 1

See ABD Chapter 17 problem 1 2nd ed. The dataset to fetch is 17-q-01.

A: Make a scatterplot, with penalty minutes as the response variable (\(y\)-axis).

B: Write the proposed simple regression model with 3 parameters.

C: Interpret each parameter of the model in the context of the research question and units of measurement.

D: Fit the model.

E: Create a scatter plot of the data overlayed with the estimated regression line, the 0.95 confidence band for the population mean, and the 0.95 confidence band for individual forecasts.

F: What do confidence intervals communicate?

Homework 2

Read the description of the data in ABD chapter 17 problem 1 2nd ed. The dataset is available from ABD Datasets as 17-q-01.

A: A hypothesis test in regression can be thought of as a comparison of two models. Write the two regression models that are implicitly being compared when testing that effect of face ratio is null.

B: Call the larger model F (for full) and the smaller model R (for reduced). Fit and plot both models.

Hints: The rms package ols function does not handle models with no predictors. To fit an intercept-only model use the built-in function lm, e.g. use lm(y ~ 1). Running anova() on this result will give you quantities you need. You can also compare a null with a non-null model doing anova(lm(y ~ 1), lm(y ~ x)).

The intercept is just the mean of y, so to plot that model you can just add a horizontal line to a regular regression plot created by ggplot(Predict(...)) using for example

f <- ols(...)
ymean <- with(d, mean(y, na.rm=TRUE))   # same as mean(d$y, na.rm=TRUE)
ggplot(Predict(f)) + geom_hline(yintercept=ymean, col='red')

C: Complete the table below from the output from part B.

Model	No. parameters	SSE	R^2
F
R

D: In order to compare F and R, find the following:

Param	Number
q	[No. params. in F] - [No. params. in R]
p	[No. params. in F]
n	[No. of obs in data frame]

E: Calculate the \(F\)-stat. The equation can be found here. Identify the \(F\)-stat in the output of the FULL regression.

F: Calculate the p-value for the \(F\)-stat. Type ?pf to see function documentation. Example R code: pf(F ratio, numerator df, denominator df, lower.tail=FALSE). Identify the \(F\)-stat p-value in the F regression output.

Homework 3

Consider the data from ABD Chapter 18 question 12 2nd edition, named 18-q-12. For variables stored as logs, create new variables that restore the original measurements by taking anti-logs.

A: Consider the model

\[E[\ln(\text{brain}) | \cdots ] = \beta_{0} + \beta_{1}\times [\text{human}] + \beta{2}\times\ln(\text{mass}) + \beta_{3}\times[\text{human}]\times\ln(\text{mass})\]

\[\text{Var}[\ln(\text{brain}) | \cdots ] = \sigma^{2}\]

Interpret each of the parameters in the model.

B: Explain or interpret, in the context of the subject matter, the hypothesis test

\(H_0\): \(\beta_{3} = 0\) \(H_\text{a}\): \(\beta_{3}\neq 0\)

(What two models are being compared? What question is the hypothesis test trying to answer?)

C: Fit the regression model and perform the hypothesis test in part B.

D: Interpret the results of the hypothesis test performed in part C. Write your interpretation in the context of the subject matter, as if you writing for the results section in a manuscript.

E: Complete the table of standard hypothesis tests. (Note that this is NOT the table printed in ABD.)

Variable	DF	F	P
Species
interaction
Mass
interaction

F: Create a plot of the regression model in part A. Emphasize plots depicting fold-change on the \(y\)-axis. Estimate a meaningful fold-change.

Hints for fold-change: When the response variable Y is logged, the anti-log of predicted differences represents fold-change in Y. When a predictor is also logged, it is of interest to estimate the fold-change effect of Y on a fold-change in that predictor. Here is a prototype where we fit a linear model with predictors x1 and x2 and their interation, where x1 and y are logged. We estimate the fold-change in y1 per a 1.25 fold-change in x1, holding x2 at the value 'A'. Make sure you properly interpret this fold-change.

f <- ols(y ~ log(x1) * x2, data=d)
summary(f, x1=c(1, 1.25), x2='A', antilog=TRUE)
# type ?summary.rms for details

Since log(x1) is linear in the model, the range for x1 given to summary can be any two numbers whose ratio is 1.25 to get a 1.25\(\times\) fold-change effect for x1. antilog=TRUE anti-logs the point estimate and the confidence limits so that we get fold-change for y.

Homework 4

Read the information for ABD Chapter 18 question 7 2nd ed, dataset 18-q-07.

A: Propose a linear regression model to analyze the study data. Write the model twice, first with reference cell parameterization and second with cell means parameterization, the latter being for extra credit.

B: Explain how the researchers might use the regression model to answer the research questions. Specifically, explain how the linear model will be used to address the possibility that female life span is reduced because the SP protein results in the costly production of more eggs. Perhaps, sketch out what the data/model would look like if the data where consistent with the possibility and what the data/model would look like otherwise.

C: Write the formal hypothesis test from part b, either in terms of betas or as a comparison of two models.

D: Fit the regression model and perform the hypothesis test from part C.

E: Interpret the results of the hypothsis test in part d as if you were writing the results for a manuscript. Address the specific question in part B.

Homework 5

Do RMS Chapter 2 problem 5. The data may be defined using this R code:

pct <- c( 4,  5,  5,  6,  6,  7,  9,  9, 10, 10, 10, 10, 12, 12, 13, 13, 
         14, 14, 14, 16, 17, 18, 20, 22, 23,
         24, 29, 37, 43, 44, 45, 49, 50, 52, 55, 57, 58, 59, 60,      
         62, 63, 63, 63, 64, 64, 68, 69,
         72, 73, 81)
score <-
c(482,498,513,498,511,479,480,483,475,476,487,494,474,478,457,
  485,451,471,473,467,470,464,471,455,452,440,460,448,441,424,
  417,422,441,408,412,400,401,430,433,433,404,424,430,
  431,437,446,424,420,432,436)

Homework 6

A: Use the NHANES data (getHdata(nhgh)) and fit a regression model that predicts height from age, race and sex. Prespecify the model, consider non-linear associations and possible interactions.

B: Generate a QQ plot of the residuals. Comment on which assumption(s) this visual diagnostic assesses. What do you conclude?

C: Generate the residual-versus-fitted plot. Comment on which assumption(s) this visual diagnostic assesses. What do you conclude?

D: Compute appropriate influence statistics. Comment on the purpose. What do you conclude?

E: Generate a single partial-effect plot of the regression with age on the \(x\)-axis and predicted height on the \(y\)-axis. Choose a single race to display. Include forecast confidence intervals.

F: Construct a linear model of gh using any of the covariates except alb, bun, and scr. Generate the QQ plot of the residuals. What do you conclude?

Homework 7

Using the NHANES data (nhgh), do the following

A: Generate a summary table that reports the amount of missing data for each variable

B: Generate a summary table that reports the patterns of missing data. What percentage of observations have all variable data.

Homework 8

Again for nhgh do the following.

A: Create a variable cluster plot of wt ht bmi leg arml armc waist tri sub age sex re

B: Generate the first principle component of the variables leg, arml, ht, report the internal coefficients of the principle component.

C: Generate the plot of generalized Spearman’s \(\rho\) for wt ht bmi leg arml armc waist tri sub age sex as predictors of gh. See this.

Homework 9

Take a break. Get caught up.

Homework 10

Using the SUPPORT data (getHdata(support)), do the following:

A: Create the potential predictive power plot and the variable cluster plot of the following variables:

sex num_co age meanbp hrt resp bili crea sod slos dzgroup hospdead when predicting ln(totcst)

B: Based on the output in part A, decide if you need to combine some variables and decide how many knots to allocate to continuous variables. Decide on a model that includes all of the variables and include a meaningful interaction of a continuous and categorical variable. (Hint: interact slos and dzgroup). Write the model in the following way (replacing ?? with the desired number of knots).

\(E[\ln \text{totcst}| \cdots] =\) sex + num_co + rcs(age, ??) +
rcs(meanbp, ??) + rcs(hrt, ??) +
… +
rcs(slos, ??) + rcs(slos, ??) * dzgroup +
dzgroup + hospdead

C: Generate residual diagnostic plots (QQ, residual by \(\hat{Y}\)) and the DFFITS influence statistics

D: Generate the in-sample calibration plot (Y by \(\hat{Y}\)). Add the line of identity or the scatter plot.

E: Generate the table of standard hypotheses with tests of no associations, tests of interaction, tests of linearity for the following variables: slos dzgroup sex

F: Calculate the mean absolute prediction error of your model.

G: Calculate a measure of rank discrimination.

H Calculate the \(R^2\) optimism.

Homework 11

Using the titanic3 data (getHdata(titanic3)), do the following

A: Calculate the effective sample size. Determine the available degrees of freedom under the 15:1 rule.

B: Fit the model reported in section 12.3 of the RMS course notes.

C: Calculate one rank discrimination measure and one discrimination measure. (See output table reported in RMS notes.)

D: Generate the calibration plot.

E: Generate the table of standard hypotheses of predictor effects.

F: Replicate the partial effect plots reported in figure 12.5 of RMS course notes.

Homework 12

Read the description of the Stress Echo dataset. Fetch the dataset using getHdata(stressEcho). Consider a logistic regression model of the composite outcome anyevent. Because there are only 89 events, researchers only want to fit a model with at most 9 parameters to lower the potential of over-fitting the data. The following variables should be included in the model:

age, as a restricted cubic spline with 3 knots
gender
dose of dobutamine
indicator that stress echocardiogram was positive
indicator that cardiologist sees wall motion anomaly on echocardiogram
history of heart attack
history of hypertension

Do the following:

A: Because the parameter budget is tight, include dose in the model with a quadratic term (using only two parameters).

B: Generate the table of standard hypotheses for this model.

C: What do you conclude from the hypothesis test of the overall impact of dose?

Homework 13

Using the NHANES (getHdata(nhgh)) data, do the following:

A: Fit an ordinal regression model of gh with sex ht and wt as predictors. Include all possible interactions. Model the association between the continuous predictors in a way that allows non-linear relationships.

B: Separately generate a partial effect plot for the 3 estimands listed below, as a function of wt ranging from 68 to 136 kg. Set sex to male and ht to 180 cm.

Mean gh
0.9 quantile of gh
\(\Pr(\text{gh} > 7)\)

Hint: see this and derive R functions that can be used as fun= in Predict to transform the \(y\)-axis from the linear predictor scale to the scale of the desired estimand. Use the rms Mean, Quantile, and ExProb functions. The latter two require more work because they require the user to specify, respectively, the quantile and the threshold for computing the desired estimates. Here is an example:

f <- orm(...)
# Derive general exceedance probability function
ex <- ExProb(f)
# Need a special case for exceeding a threshold (here, Y=3)
# Compute P(y > 3)
ex3 <- function(lp) ex(lp, y=3)
ggplot(Predict(f, x, fun=ex3)) + ylab('Pr(Y > 3)')