Regression Modeling Strategies Study Questions

Regression Modeling Strategies, 2nd Edition

Review

Consider inference from comparing central tendencies of two groups on a continuous response variable Y. What assumptions are you willing to make when selecting a statistical test? Why are you willing to make those assumptions?
Consider the comparison of 5 groups on a continuous Y. Suppose you observe that two of the groups have a similar mean and the other three also have a similar sample mean. What is wrong with combining the two samples and combining the three samples, then comparing two means? How does this compare to stepwise variable selection?
Name a specific statistical test for which we don’t have a corresponding statistical model
Concerning a multi-group problem or a sequential testing problem what is the frequentist approach to multiplicity correction? The Bayesian approach?

Chapter 1

Can you estimate the effect of increasing age from 21 to 30 without a statistical model?
What is an example where machine learning users have used “classification” in the wrong sense?
When is classification (in the proper sense) an appropriate goal?
Why are so many decisions non-binary?
How do we normally choose statistical models—from subject matter theory or empirically?
What is model uncertainty?
An investigator feels that there are too many variables to analyze so she uses significance testing to select which variables to analyze further. What is wrong with that?

Chapter 2

Contrast statistical models vs. machine learning in a few ways

Section 2.1-2.2

What is always linear in the linear model?

Section 2.3

Define an interaction effect in words
In a model with main effects and two-way interactions, which regression parameter is the most “universal” by virtue of being most independent of coding?
When a predictor is involved in an interaction, which test of association involving that predictor is preferred?
What are two ways of doing chunk (composite) tests?
An analyst intends to use least squares multiple regression to predict follow-up blood pressure from baseline blood pressure, age, and sex. She wants to use this model for estimation, prediction, and inference (statistical tests and confidence limits). List all of the assumptions she makes by fitting the model \(Y = \beta_{0} + \beta_{1} bp + \beta_{2} age + \beta_{3} (sex=female)\).
List as many methods as you can for checking the assumptions you listed.

Section 2.4.1

Provide an example where categorizing a continuous predictor is a good idea
If you dichotomize a continuous predictor that has a smooth relationship with Y, how can you arbitrarily change its estimated regression coefficients?
What is a general term used for the statistical problem caused by categorizing continuous variables?
What tends to happen when data are used to find cutpoints for relationships that are in fact smooth?
What other inferential distortion is caused by categorization?
Is there an amount of noise that could be added to a continuous X that makes a categorized analysis better than a continuous one?

Section 2.4.2

What is the biggest drawback to using regular polynomials to fit smooth nonlinear relationships?
Why does a spline function require one to use a truncated term?
What advantages does a linearly tail-restricted cubic spline have?
Why are its constructed variables X’ X’’ X’’’ etc. so complex?
What do you need to do to handle a sudden curvature change when fitting a nonlinear spline?
Barring a sudden curvature change, why are knot locations not so important in a restricted cubic spline (rcs)?
What is the statistical term for the delimma that occurs when choosing the number of knots?
What is a Bayesian way of handling this?
What is the core idea behind Cleveland’s loess smoother?
Can fitted spline functions for some of the predictors be an end in themselves?

Section 2.5

What is the worst thing about recursive partitioning?
What is a major tradeoff between the modeling strategy we will largely use and ML?
What are some determinants of the best choice between SM and ML?

Section 2.6

What is the main message of the Grambsch and O’Brien paper?
What notational error do many analysts make that lead them to ignore model uncertainty?

Section 2.7.2

Why is it a good idea to pre-specify interactions, and how is this done?
Why do we not fiddle with transformations of a predictor before considering interactions?
What 3-dimensional restriction does the cross-products of two restricted cubic splines force?
What is an efficient and likely-to-fit approach to allowing for the effect of measurement degradement?

Chapter 3

Section 3.1

What is the problem with doing ordinary analysis of data from survey responders?

Section 3.4

What problem is always present when doing complete-case analysis when missing values exist in the data?
What problem is often present?
Why does imputation not help very much when a variable being imputed is a main variable of interest in the analysis?
What is a major reason that adding a new category for a predictor for missings doesn’t work?
Why does inserting a constant for missing values of a continuous predictor primarily fail?

Section 3.5

What is a more accurate statement than “imputation boosts the sample size”?
What does single-value fill-in of missings almost always damage?
What is a general way to describe why predictive mean matching (PMM) works?
What is a general advantage of PMM?

Section 3.6

Why does single conditional mean imputation result in biased regression coefficients?

Section 3.8

Why can multiple imputation use Y in predicting X?
What are the sources of uncertainty that a multiple imputation algorithm must take into account for final standard errors to not be underestimated?

Section 3.9

Explain the Yucel-Zaslavsky diagnosic and what it is checking for.

Section 3.10

What is the only reason not to always do 100 or more imputations?

Section 3.11

If using multiple imputation but within a Bayesian framework, what is a major advantage of posterior stacking over what we have been doing in the frequentist domain?

Chapter 4

Section 4.1

An investigator is interested in finding risk factors for stroke. She has collected data on 3000 subjects of whom 70 were known to suffer a stroke within 5 years from study entry. She has decided that race (5 possible levels), age, height, and mean arterial blood pressure should definitely be in the model. She does not try to find previous studies that show how these potential risk factors relate to the risk of stroke. What is one way to decide how many degrees of freedom to spend on each predictor? How can one achieve these d.f. (whatever they are) for race and for a continuous variable?
What is the driving force behind the approach used here to prespecify the complexity with which each predictor is specified in the regression model?
Why is the process fair?
When feasible to do, why is learning from a saturated model preferred?

Section 4.3

What did we already study that variable selection is an extension of?
What does variable selection ruin more than it ruins regression coefficient estimates?
Besides those problems what’s the worst thing you can say about statistical test-based variable selection?
In which situation would variable selection have a chance to be helpful?
What is the statistical point that the Maxwell demon analogy is trying to bring out?
What is the single root cause of problems that is in common to each of the following strategies?
- The investigator retains race as 5 categories and fits it with 4 dummy variables. She examines coefficients from these dummy variables to decide which races have similar coefficients and should be pooled.
- All continuous variables are allowed to be nonlinear using splines, and variables having insignificant nonlinear terms are re-fitted as linear.
- All potential risk factors are used as candidate predictor variables with forward stepwise variable selection with a significance level for entering the model of \(\alpha=0.1\).
- A clinical trial of 5 treatments find that 3 of the treatments result in nearly the same blood pressure reduction. These 3 treatments are pooled and their overall mean is compared with each of the remaining 2 treatments.
Name at least one specific harm done by the strategies outlined in the last question, in general terms (not specific to each strategy).

Section 4.4

Describe a major way that “breaking ties” in Y, i.e., having it more continuous, results in more power and precision in modeling.
What is the effective sample size in a study with 10 cancer outcomes and 100,000 normal outcomes?

Section 4.5

What are main causes of passive shrinkage/regression to the mean?
Why does the heuristic shrinkage estimate work?
What causes the graph of \(\hat{Y}\) (on \(x\)-axis) vs. \(Y\) (on \(y\)-axis; you could call this \(Y_{new}\)) to be flatter than a \(45^\circ\) line when the graph is made for new data not used to fit the model?

Section 4.6

What causes the standard error of a regression coefficient to get large when adjusting for a competing predictor?
When just wanting to measure strength of association or test a group of predictors, what is a simple fix for co-linearity?

Section 4.7

What is the purpose of data reduction, and why does it not create phantom degrees of freedom that need to be accounted for later in the modeling process?
Name several data reduction methods for reducing the number of regression coefficients to estimate.
Principal components can have unstable loadings (e.g., if you bootstrap). Why are they still a valuable component in a data reduction strategy?
What is the overall guidance for knowing how much unsupervised learning (data reduction) to do?

Section 4.9

Why are high leverage (far out) observations influential on regression coefficients even when Y is binary?

Section 4.10

What would make one model having much better predictive discrimination than another irrelevant?

Section 4.12

Are the steps to the default strategy complete and is the sequencing “correct”?

Chapter 5

Section 5.1

What is a general way to solve the problem of how to understand a complex model?
Why is it useful to estimate the effect of changing a continuous predictor on the outcome by varying the predictor from its \(25^{th}\) to its \(75^{th}\) percentile?

Section 5.2

Describe briefly in very general terms what the bootstrap involves.
The bootstrap may be used to estimate measures of many types of statistical quantities. Which type is being estimated when it comes to model validation?

Section 5.3

What are some disadvantages of external validation?
What would cause an independently validated \(R^2\) to be negative?
A model to predict the probability that a patient has a disease was developed using patients cared for at hospital A. Sort the following in order from the worst way to validate the predictive accuracy of a model to the best and most stringent way. You can just specify the letters corresponding to the following, in the right order.
1. compute the predictive accuracy of the model when applied to a large number of patients in another country
2. compute the accuracy on a large number of patients in hospital B that is in the same state as hospital A
3. compute the accuracy on 5 patients from hospital \(A\) that were collected after the original sample used in model fitting was completed
4. use the bootstrap
5. randomly split the data into two halves assuming that the dataset is not extremely large; develop the model on one half, freeze it, and evaluate the predictive accuracy on the other half
6. report how well the model predicted the disease status of the patients used to fit the model
7. compute the accuracy on patients in hospital \(A\) that were admitted after the last patient’s data used to fit the model were collected, assuming that both samples were large

Section 5.6

What is a way to do a misleading interaction analysis?
What would cause the most embarrassment to a researcher who analyzed a high-dimensional dataset to find “winning” associations?

Chapter 7

Section 7.2

When should one model the time-response profile using discrete time?

Section 7.3

What makes generalized least squares and mixed effect models relatively robust to non-completely-random dropouts?
What does the last observation carried forward method always violate?

Section 7.4

Which correlation structure do you expect to fit the data when there are rapid repetitions over a short time span? When the follow-up time span is very long?

Section 7.8

What can go wrong if many correlation structures are tested in one dataset?
In a longitudinal intervention study, what is the most typical comparison of interest? Is it best to borrow information in estimating this contrast?

Chapter 8

Section 8.2

Critique the choice of the number of parameters devoted to each predictor.

Section 8.3

Explain why the final amount of redundancy can change from the initial assessment.

Section 8.5

What is the meaning of the first canonical variate?
Explain the first row of numbers in the matrix of coefficients of canonical variates.
The transformations in Figure 8.3 are not optimal for predicting the outcome. What good can be said of them?

Section 8.6

Why in general terms do principal components work well in predictive modeling?
What is the advantage of sparse PCs over regular PCs?

Chapter 9

Section 9.1

What does the MLE optimize?
Describe in general terms why the information matrix is related to precision of MLEs.

Section 9.2

Explain in general terms how the Rao efficient score test works.

Section 9.3

For right-censored time-to-event data, what is the likelihood component for an uncensored and for a censored data value?
What makes the Wald test not very accurate in general?
Why are Wald confidence intervals symmetric?

Section 9.4

Why do we like the Newton-Raphson method for solving for MLEs?

Section 9.5

Why does the robust cluster sandwich covariance estimator not make sense at a fundamental level?
Will the cluster sandwich covariance estimator properly detect and ignore observations that are duplicated?

Section 9.7

Why are most bootstrap confidence intervals not extremely accurate (other than from the double bootstrap or the bootstrap t method)?
What is the real appeal of bootstrap confidence intervals?

Section 9.8

Why is AIC not an answer to the question of whether a biomarker is useful to know, after adjusting for other predictors?
What is the appeal of the adequacy index?

Section 9.10

What is are the main challenges with penalized MLE?
Can penalization hurt estimates?
What is the most principled method for doing statistical inference if one is penalizing parameter estimates?

Chapter 10

Section 10.1

Why can the relationship between X and the log odds possibly be linear?
Why can the relationship between X and the probability not possibly be linear over a wide range of X if X is powerful?
In the logistic model Prob\([Y=1 | X] = \frac{1}{1 + \exp(-X\beta)} = P\), what is the inverse transformation of \(P\) that “frees” \(X\beta\), both in mathematical form and interpreted using words?
A logistic model is used to relate treatment to the probability of patient response. X is coded 0 for treatment A, 1 for treatment B, and the model is Prob\([Y=1 |\) treatment\(]= \frac{1}{1+\exp[-(\beta_{0} + \beta_{1}X)]}\). What are the interpretations of \(\beta_{0}\) and \(\beta_{1}\) in this model? What is the interpretation of \(\exp(\beta_{1})\)?
What does the estimate of \(\beta\) optimize in the logistic model?
In OLS an \(F\) statistic involves the ratio between the explained sum of squares and an estimate of \(\sigma^2\). The numerator d.f. is the number of parameters being tested, and the denominator d.f. is the d.f. for error. Why do we always use \(\chi^2\) rather than \(F\) statistics with the logistic model? What denominator d.f. could you say a \(\chi^2\) statistic has?
Consider a logistic model logit\((Y=1 | X) = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2}\), where \(X_1\) is binary and \(X_2\) is continuous. List all of the assumptions made by this model.
A binary logistic model containing one variable (treatment, A vs. B) is fitted, resulting in \(\hat{\beta_{0}} = 0, \hat{\beta_{1}} = 1\), where the dummy variable in the model was I[treatment=B]. Compute the estimated B:A odds ratio, the odds of an event for patients on treatment B, and (to two decimal places) the probability of an event for patients on B.
What is the central appeal of the odds ratio?
Given the absence of interactions in a logistic model, which sort of subjects will show the maximum change in absolute risk when you vary any strong predictor?
Would binary logistic regression be a good method for estimating the probability than a response exceeds a certain threshold?

Section 10.2

What is the best basis for computing a confidence interval for a risk estimate?
If subect risks are not mostly between 0-0.2 and 0.8-1.0, what is the minimum sample size for fitting a binary logistic model?

Section 10.5

What is the best way to model non-additive effects of two continuous predictors?
The lowess nonparametric smoother, with outlier detection turned off, is an excellent way to depict how a continuous predictor \(X\) relates to a binary response \(Y\) (although splining the predictor in a binary logistic model may perform better). How would one modify the usual lowess plot to result in a graph that assesses the fit of a simple linear logistic model containing only this one predictor \(X\) as a linear effect?

Section 10.8

What are examples of high-information measures of predictive ability?
What is a measure of pure discrimination for the binary logistic model? What commonly used measure in medical diagnosis is this measure equivalent to?
Name something wrong with using the percent of correctly classified observations to quantify the accuracy of a logistic model.

Section 10.9

What is the best way to demonstrate the absolute predictive accuracy of a binary logistic model?

Chapter 12

Section 12.2

Are some of the parametric logistic models we fitted as flexible as the 6-strata loess nonparametric estimates in Figure 12.3?

Section 12.5

From Table 12.4 why did the partial \(\chi^2\) for age not increase very much over casewise deletion?

Chapter 13 Section 13.1

Is a ratio-scaled numeric variable ordinal?

Section 13.2

When using an ordinal regression model, what is the only thing that will change if you transform Y with a monotonically increasing function?

Section 13.3

What is another name for violation of the proportional odds assumption?
Name a statistical test that is a special case of testing hypotheses in the proportional odds model.
What does the Wilcoxon test assume for optimum power?
For a clinical trial, quality of life of patients is coded Y=1, 2, 3, 4 for convenience, corresponding to the categories poor, fair, good, excellent. Write a formal mathematical statement for a model that may be useful for relating treatment and baseline characteristics to Y.
A proportional odds model is fit to the above Y with the only predictor being treatment, coded X=0 for control and X=1 for experimental. The model has an intercept of -1 and a regression coefficient \(\beta_{1} = \log(2)\). How many different odds ratios can you get out of this model? What are their values? What are their interpretations? What does the model assume with respect to the subject matter, in the most general terms?
Describe at least one approach to assessing the proportional odds assumption for this model.

Section 13.4

What is another name for the continuation ratio ordinal logistic model?

Chapter 15

Section 15-15.2

How are parametric robust regression models and generalized additive models not robust or are arbitrary?

Section 15.3

What is the semiparametric regression equivalent to the Q-Q plot?
Which estimand that uses a semiparametric ordinal model dependent on the Y-transformation?

Section 15.5

Why does a parametric regression model make a linearity assumption on the appropriately transformed cumulative distribution of Y?
What is the parallelism assumption for the normal linear model?

Section 15.6

What are two major appeals of ordinal regression?
What can make predicted quantiles from ordinal models not as accurate as quantile regression?
What is the connection between the mean and median for a log-normal distribution?

Chapter 16

Section 16.2

In what way is generalized additive models not flexible or robust?

Section 16.3

In what way is estimation of the Y transformation especially honest?

Section 16.4

Suppose that one knew that analyzing Y after log transformation is appropriate. Why might one use the smearing estimator to estimate the mean of Y on the original scale and not use the MLE that comes from the log-normal model?

Chapter 17

Section 17.5

Which nonparametric estimator of S(t), the Kaplan-Meier or the Altschuler-Nelson estimator, is better?

Section 17.6

For estimating which quantity in competing risk analysis is it OK to just censor on events other than the event being considered?
When there are correlated events for a multiple time-to-event analysis, which estimator is perhaps the most interpretable?

Chapter 18

Section 18.1

Which estimator gives us an important clue about what defines information content in time-to-event modeling, and how does it do this?

Section 18.3

Which parametric survival model intersects the accelerated failure time and proportion hazards model classes?

Section 18.7

What are two weaknesses of the c-index?

Section 18.8

What is a simpler approach to analyzing time-dependent covariates that do not change continuously in time?