require(Hmisc) # make Hmisc available if not done earlier
# note: Hmisc is always there if require(rms) was done earlier
getRs('abd.r') # fetch abd function
# Fetch csv file from Github, read it, and store as data frame d
<- getabd('dataset name') # e.g. getabd('17-q-01') d
MSCI Biostatistics II Homework Assignments
Unless stated otherwise these problems correspond to The Analysis of Biological Data (ABD) practice problems. ABD datasets may be fetched using commands like that below, embedded into your analysis script.
Homework 1
See ABD Chapter 17 problem 1 2nd ed. The dataset to fetch is 17-q-01
.
A: Make a scatterplot, with penalty minutes as the response variable (\(y\)-axis).
B: Write the proposed simple regression model with 3 parameters.
C: Interpret each parameter of the model in the context of the research question and units of measurement.
D: Fit the model.
E: Create a scatter plot of the data overlayed with the estimated regression line, the 0.95 confidence band for the population mean, and the 0.95 confidence band for individual forecasts.
F: What do confidence intervals communicate?
Homework 2
Read the description of the data in ABD chapter 17 problem 1 2nd ed. The dataset is available from ABD Datasets as 17-q-01
.
A: A hypothesis test in regression can be thought of as a comparison of two models. Write the two regression models that are implicitly being compared when testing that effect of face ratio is null.
B: Call the larger model F (for full) and the smaller model R (for reduced). Fit and plot both models.
Hints: The rms
package ols
function does not handle models with no predictors. To fit an intercept-only model use the built-in function lm
, e.g. use lm(y ~ 1)
. Running anova()
on this result will give you quantities you need. You can also compare a null with a non-null model doing anova(lm(y ~ 1), lm(y ~ x))
.
The intercept is just the mean of y
, so to plot that model you can just add a horizontal line to a regular regression plot created by ggplot(Predict(...))
using for example
<- ols(...)
f <- with(d, mean(y, na.rm=TRUE)) # same as mean(d$y, na.rm=TRUE)
ymean ggplot(Predict(f)) + geom_hline(yintercept=ymean, col='red')
C: Complete the table below from the output from part B.
Model | No. parameters | SSE | R^2 |
---|---|---|---|
F | |||
R |
D: In order to compare F and R, find the following:
Param | Number |
---|---|
q | [No. params. in F] - [No. params. in R] |
p | [No. params. in F] |
n | [No. of obs in data frame] |
E: Calculate the \(F\)-stat. The equation can be found here. Identify the \(F\)-stat in the output of the FULL regression.
F: Calculate the p-value for the \(F\)-stat. Type ?pf
to see function documentation. Example R code: pf(F ratio, numerator df, denominator df, lower.tail=FALSE)
. Identify the \(F\)-stat p-value in the F regression output.
Homework 3
Consider the data from ABD Chapter 18 question 12 2nd edition, named 18-q-12
. For variables stored as logs, create new variables that restore the original measurements by taking anti-logs.
A: Consider the model
\[E[\ln(\text{brain}) | \cdots ] = \beta_{0} + \beta_{1}\times [\text{human}] + \beta{2}\times\ln(\text{mass}) + \beta_{3}\times[\text{human}]\times\ln(\text{mass})\]
\[\text{Var}[\ln(\text{brain}) | \cdots ] = \sigma^{2}\]
Interpret each of the parameters in the model.
B: Explain or interpret, in the context of the subject matter, the hypothesis test
\(H_0\): \(\beta_{3} = 0\) \(H_\text{a}\): \(\beta_{3}\neq 0\)
(What two models are being compared? What question is the hypothesis test trying to answer?)
C: Fit the regression model and perform the hypothesis test in part B.
D: Interpret the results of the hypothesis test performed in part C. Write your interpretation in the context of the subject matter, as if you writing for the results section in a manuscript.
E: Complete the table of standard hypothesis tests. (Note that this is NOT the table printed in ABD.)
Variable | DF | F | P |
---|---|---|---|
Species | |||
interaction | |||
Mass | |||
interaction |
F: Create a plot of the regression model in part A. Emphasize plots depicting fold-change on the \(y\)-axis. Estimate a meaningful fold-change.
Hints for fold-change: When the response variable Y is logged, the anti-log of predicted differences represents fold-change in Y. When a predictor is also logged, it is of interest to estimate the fold-change effect of Y on a fold-change in that predictor. Here is a prototype where we fit a linear model with predictors x1
and x2
and their interation, where x1
and y
are logged. We estimate the fold-change in y1
per a 1.25 fold-change in x1
, holding x2
at the value 'A'
. Make sure you properly interpret this fold-change.
<- ols(y ~ log(x1) * x2, data=d)
f summary(f, x1=c(1, 1.25), x2='A', antilog=TRUE)
# type ?summary.rms for details
Since log(x1)
is linear in the model, the range for x1
given to summary
can be any two numbers whose ratio is 1.25 to get a 1.25\(\times\) fold-change effect for x1
. antilog=TRUE
anti-logs the point estimate and the confidence limits so that we get fold-change for y
.
Homework 4
Read the information for ABD Chapter 18 question 7 2nd ed, dataset 18-q-07
.
A: Propose a linear regression model to analyze the study data. Write the model twice, first with reference cell parameterization and second with cell means parameterization, the latter being for extra credit.
B: Explain how the researchers might use the regression model to answer the research questions. Specifically, explain how the linear model will be used to address the possibility that female life span is reduced because the SP protein results in the costly production of more eggs. Perhaps, sketch out what the data/model would look like if the data where consistent with the possibility and what the data/model would look like otherwise.
C: Write the formal hypothesis test from part b, either in terms of betas or as a comparison of two models.
D: Fit the regression model and perform the hypothesis test from part C.
E: Interpret the results of the hypothsis test in part d as if you were writing the results for a manuscript. Address the specific question in part B.
Homework 5
Do RMS Chapter 2 problem 5. The data may be defined using this R code:
<- c( 4, 5, 5, 6, 6, 7, 9, 9, 10, 10, 10, 10, 12, 12, 13, 13,
pct 14, 14, 14, 16, 17, 18, 20, 22, 23,
24, 29, 37, 43, 44, 45, 49, 50, 52, 55, 57, 58, 59, 60,
62, 63, 63, 63, 64, 64, 68, 69,
72, 73, 81)
<-
score c(482,498,513,498,511,479,480,483,475,476,487,494,474,478,457,
485,451,471,473,467,470,464,471,455,452,440,460,448,441,424,
417,422,441,408,412,400,401,430,433,433,404,424,430,
431,437,446,424,420,432,436)
Homework 6
A: Use the NHANES data (getHdata(nhgh)
) and fit a regression model that predicts height from age, race and sex. Prespecify the model, consider non-linear associations and possible interactions.
B: Generate a QQ plot of the residuals. Comment on which assumption(s) this visual diagnostic assesses. What do you conclude?
C: Generate the residual-versus-fitted plot. Comment on which assumption(s) this visual diagnostic assesses. What do you conclude?
D: Compute appropriate influence statistics. Comment on the purpose. What do you conclude?
E: Generate a single partial-effect plot of the regression with age on the \(x\)-axis and predicted height on the \(y\)-axis. Choose a single race to display. Include forecast confidence intervals.
F: Construct a linear model of gh
using any of the covariates except alb
, bun
, and scr
. Generate the QQ plot of the residuals. What do you conclude?
Homework 7
Using the NHANES data (nhgh
), do the following
A: Generate a summary table that reports the amount of missing data for each variable
B: Generate a summary table that reports the patterns of missing data. What percentage of observations have all variable data.
Homework 8
Again for nhgh
do the following.
A: Create a variable cluster plot of wt ht bmi leg arml armc waist tri sub age sex re
B: Generate the first principle component of the variables leg, arml, ht
, report the internal coefficients of the principle component.
C: Generate the plot of generalized Spearman’s \(\rho\) for wt ht bmi leg arml armc waist tri sub age sex
as predictors of gh
. See this.
Homework 9
Take a break. Get caught up.
Homework 10
Using the SUPPORT
data (getHdata(support)
), do the following:
A: Create the potential predictive power plot and the variable cluster plot of the following variables:
sex num_co age meanbp hrt resp bili crea sod slos dzgroup hospdead
when predicting ln(totcst)
B: Based on the output in part A, decide if you need to combine some variables and decide how many knots to allocate to continuous variables. Decide on a model that includes all of the variables and include a meaningful interaction of a continuous and categorical variable. (Hint: interact slos
and dzgroup
). Write the model in the following way (replacing ?? with the desired number of knots).
\(E[\ln \text{totcst}| \cdots] =\) sex + num_co + rcs(age, ??) +
rcs(meanbp, ??) + rcs(hrt, ??) +
… +
rcs(slos, ??) + rcs(slos, ??) * dzgroup +
dzgroup + hospdead
C: Generate residual diagnostic plots (QQ, residual by \(\hat{Y}\)) and the DFFITS influence statistics
D: Generate the in-sample calibration plot (Y by \(\hat{Y}\)). Add the line of identity or the scatter plot.
E: Generate the table of standard hypotheses with tests of no associations, tests of interaction, tests of linearity for the following variables: slos dzgroup sex
F: Calculate the mean absolute prediction error of your model.
G: Calculate a measure of rank discrimination.
H Calculate the \(R^2\) optimism.
Homework 11
Using the titanic3
data (getHdata(titanic3)
), do the following
A: Calculate the effective sample size. Determine the available degrees of freedom under the 15:1 rule.
B: Fit the model reported in section 12.3 of the RMS course notes.
C: Calculate one rank discrimination measure and one discrimination measure. (See output table reported in RMS notes.)
D: Generate the calibration plot.
E: Generate the table of standard hypotheses of predictor effects.
F: Replicate the partial effect plots reported in figure 12.5 of RMS course notes.
Homework 12
Read the description of the Stress Echo dataset. Fetch the dataset using getHdata(stressEcho)
. Consider a logistic regression model of the composite outcome anyevent
. Because there are only 89 events, researchers only want to fit a model with at most 9 parameters to lower the potential of over-fitting the data. The following variables should be included in the model:
- age, as a restricted cubic spline with 3 knots
- gender
- dose of dobutamine
- indicator that stress echocardiogram was positive
- indicator that cardiologist sees wall motion anomaly on echocardiogram
- history of heart attack
- history of hypertension
Do the following:
A: Because the parameter budget is tight, include dose in the model with a quadratic term (using only two parameters).
B: Generate the table of standard hypotheses for this model.
C: What do you conclude from the hypothesis test of the overall impact of dose?
Homework 13
Using the NHANES (getHdata(nhgh)
) data, do the following:
A: Fit an ordinal regression model of gh
with sex ht
and wt
as predictors. Include all possible interactions. Model the association between the continuous predictors in a way that allows non-linear relationships.
B: Separately generate a partial effect plot for the 3 estimands listed below, as a function of wt
ranging from 68 to 136 kg. Set sex
to male and ht
to 180 cm.
- Mean
gh
- 0.9 quantile of
gh
- \(\Pr(\text{gh} > 7)\)
Hint: see this and derive R functions that can be used as fun=
in Predict
to transform the \(y\)-axis from the linear predictor scale to the scale of the desired estimand. Use the rms
Mean
, Quantile
, and ExProb
functions. The latter two require more work because they require the user to specify, respectively, the quantile and the threshold for computing the desired estimates. Here is an example:
<- orm(...)
f # Derive general exceedance probability function
<- ExProb(f)
ex # Need a special case for exceeding a threshold (here, Y=3)
# Compute P(y > 3)
<- function(lp) ex(lp, y=3)
ex3 ggplot(Predict(f, x, fun=ex3)) + ylab('Pr(Y > 3)')