# Statistical Thinking

## Posts

### Wedding Bayesian and Frequentist Designs Created a Mess

This article is about a real example in which creation of a hybrid Bayesian-frequentist RCT design created an analytical mess.

### Ordinal Models for Paired Data

This article briefly discusses why the rank difference test is better than the Wilcoxon signed-rank test for paired data, then shows how to generalize the rank difference test using the proportional odds ordinal logistic semiparametric regression model. To make the regression model work for non-independent (paired) measurements, the robust cluster sandwich covariance estimator is used for the log odds ratio. Power and type I assertion \(\alpha\) probabilities are compared with the paired \(t\)-test for \(n=25\). The ordinal model yields \(\alpha=0.05\) under the null and has power that is virtually as good as the optimum paired \(t\)-test. For non-normal data the ordinal model power exceeds that of the parametric test.

### Randomized Clinical Trials Do Not Mimic Clinical Practice, Thank Goodness

Randomized clinical trials are successful because they do not mimic clinical practice. They remain highly clinically relevant despite this.

### Biostatistical Modeling Plan

This is an example statistical plan for project proposals where the goal is to develop a biostatistical model for prediction, and to do external or strong internal validation of the model.

### How to Do Bad Biomarker Research

This article covers some of the bad statistical practices that have crept into biomarker research, including setting the bar too low for demonstrating that biomarker information is new, believing that winning biomarkers are really “winners”, and improper use of continuous variables. Step-by-step guidance is given for ensuring that a biomarker analysis is not reproducible and does not provide clinically useful information.

### R Workflow

An overview of R Workflow, which covers how to use R effectively all the way from importing data to analysis, and making use of `Quarto`

for reproducible reporting.

### Decision curve analysis for quantifying the additional benefit of a new marker

This article examines the benefits of decision curve analysis for assessing model performance when adding a new marker to an existing model. Decision curve analysis provides a clinically interpretable metric based on the number of events identified and interventions avoided.

### Equivalence of Wilcoxon Statistic and Proportional Odds Model

In this article I provide much more extensive simulations showing the near perfect agreement between the odds ratio (OR) from a proportional odds (PO) model, and the Wilcoxon two-sample test statistic. The agreement is studied by degree of violation of the PO assumption and by the sample size. A refinement in the conversion formula between the OR and the Wilcoxon statistic scaled to 0-1 (corcordance probability) is provided.

### Longitudinal Data: Think Serial Correlation First, Random Effects Second

Most analysts automatically turn towards random effects models when analyzing longitudinal data. This may not always be the most natural, or best fitting approach.

### Assessing the Proportional Odds Assumption and Its Impact

This article demonstrates how the proportional odds (PO) assumption and its impact can be assessed. General robustness to non-PO on either a main variable of interest or on an adjustment covariate are exemplified. Advantages of a continuous Bayesian blend of PO and non-PO are also discussed.

### Commentary on Improving Precision and Power in Randomized Trials for COVID-19 Treatments Using Covariate Adjustment, for Binary, Ordinal, and Time-to-Event Outcomes

This is a commentary on the paper by Benkeser, Díaz, Luedtke, Segal, Scharfstein, and Rosenblum

### Incorrect Covariate Adjustment May Be More Correct than Adjusted Marginal Estimates

This article provides a demonstration that the perceived non-robustness of nonlinear models for covariate adjustment in randomized trials may be less of an issue than the non-transportability of marginal so-called robust estimators.

### Avoiding One-Number Summaries of Treatment Effects for RCTs with Binary Outcomes

This article presents an argument that for RCTs with a binary outcome the primary result should be a distribution and not any single number summary. The GUSTO-I study is used to exemplify risk difference distributions.

### If You Like the Wilcoxon Test You Must Like the Proportional Odds Model

Since the Wilcoxon test is a special case of the proportional odds (PO) model, if one likes the Wilcoxon test, one must like the PO model. This is made more convincing by showing examples of how one may accurately compute the Wilcoxon statistic from the PO model’s odds ratio.

### Violation of Proportional Odds is Not Fatal

Many researchers worry about violations of the proportional hazards assumption when comparing treatments in a randomized study. Besides the fact that this frequently makes them turn to a much worse approach, the harm done by violations of the proportional odds assumption usually do not prevent the proportional odds model from providing a reasonable treatment effect assessment.

### RCT Analyses With Covariate Adjustment

This article summarizes arguments for the claim that the primary analysis of treatment effect in a RCT should be with adjustment for baseline covariates. It reiterates some findings and statements from classic papers, with illustration on the GUSTO-I trial.

### Bayesian Methods to Address Clinical Development Challenges for COVID-19 Drugs and Biologics

The COVID-19 pandemic has elevated the challenge for designing and executing clinical trials with vaccines and drug/device combinations within a substantially shortened time frame. Numerous challenges in designing COVID-19 trials include lack of prior data for candidate interventions / vaccines due to the novelty of the disease, evolving standard of care and sense of urgency to speed up development programmes. We propose sequential and adaptive Bayesian trial designs to help address the challenges inherent in COVID-19 trials. In the Bayesian framework, several methodologies can be implemented to address the complexity of the primary endpoint choice. Different options could be used for the primary analysis of the WHO Severity Scale, frequently used in COVID-19 trials. We propose the longitudinal proportional odds mixed effects model using the WHO Severity Scale ordinal scale. This enables efficient utilization of all clinical information to optimize sample sizes and maximize the rate of acquiring evidence about treatment effects and harms.

### Implications of Interactions in Treatment Comparisons

This article explains how the generalizability of randomized trial findings depends primarily on whether and how patient characteristics modify (interact with) the treatment effect. For an observational study this will be related to overlap in the propensity to receive treatment.

### The Burden of Demonstrating HTE

Reasons are given for why heterogeneity of treatment effect must be demonstrated, not assumed. An example is presented that shows that HTE must exceed a certain level before personalizing treatment results in better decisions than using the average treatment effect for everyone.

### Assessing Heterogeneity of Treatment Effect, Estimating Patient-Specific Efficacy, and Studying Variation in Odds ratios, Risk Ratios, and Risk Differences

This article shows an example formally testing for heterogeneity of treatment effect in the GUSTO-I trial, shows how to use penalized estimation to obtain patient-specific efficacy, and studies variation across patients in three measures of treatment effect.

### Statistically Efficient Ways to Quantify Added Predictive Value of New Measurements

Researchers have used contorted, inefficient, and arbitrary analyses to demonstrated added value in biomarkers, genes, and new lab measurements. Traditional statistical measures have always been up to the task, and are more powerful and more flexible. It’s time to revisit them, and to add a few slight twists to make them more helpful.

### In Machine Learning Predictions for Health Care the Confusion Matrix is a Matrix of Confusion

The performance metrics chosen for prediction tools, and for Machine Learning in particular, have significant implications for health care and a penetrating understanding of the AUROC will lead to better methods, greater ML value, and ultimately, benefit patients.

### Viewpoints on Heterogeneity of Treatment Effect and Precision Medicine

This article provides my reflections after the PCORI/PACE Evidence and the Individual Patient meeting on 2018-05-31. The discussion includes a high-level view of heterogeneity of treatment effect in optimizing treatment for individual patients.

### Musings on Multiple Endpoints in RCTs

This article discusses issues related to alpha spending, effect sizes used in power calculations, multiple endpoints in RCTs, and endpoint labeling. Changes in endpoint priority is addressed. Included in the the discussion is how Bayesian probabilities more naturally allow one to answer multiple questions without all-too-arbitrary designations of endpoints as “primary” and “secondary”. And we should not quit trying to learn.

### Improving Research Through Safer Learning from Data

What are the major elements of learning from data that should inform the research process? How can we prevent having false confidence from statistical analysis? Does a Bayesian approach result in more honest answers to research questions? Is learning inherently subjective anyway, so we need to stop criticizing Bayesians’ subjectivity? How important and possible is pre-specification? When should replication be required? These and other questions are discussed.

### Is Medicine Mesmerized by Machine Learning?

Deep learning and other forms of machine learning are getting a lot of press in medicine. The reality doesn’t match the hype, and interpretable statistical models still have a lot to offer.

### Information Gain From Using Ordinal Instead of Binary Outcomes

This article gives examples of information gained by using ordinal over binary response variables. This is done by showing that for the same sample size and power, smaller effects can be detected.

### How Can Machine Learning be Reliable When the Sample is Adequate for Only One Feature?

It is easy to compute the sample size N_{1} needed to reliably estimate how one predictor relates to an outcome. It is next to impossible for a machine learning algorithm entertaining hundreds of features to yield reliable answers when the sample size < N_{1}.

### Statistical Criticism is Easy; I Need to Remember That Real People are Involved

Criticism of medical journal articles is easy. I need to keep in mind that much good research is done even if there are some flaws in the design, analysis, or interpretation. I also need to remember that real people are involved.

### Continuous Learning from Data: No Multiplicities from Computing and Using Bayesian Posterior Probabilities as Often as Desired

This article describes the drastically different way that sequential data looks operate in a Bayesian setting compared to a classical frequentist setting.

### Bayesian vs. Frequentist Statements About Treatment Efficacy

This article contrasts language used when reporting a classical frequentist treatment comparison vs. a Bayesian one, and describes why Bayesian statements convey more actionable information.

### Integrating Audio, Video, and Discussion Boards with Course Notes

In this article I seek recommendations for integrating various media for teaching long courses.

### EHRs and RCTs: Outcome Prediction vs. Optimal Treatment Selection

Observational data from electronic health records may contain biases that large sample sizes do not overcome. Moderate confounding by indication may render an infinitely large observational study less useful than a small randomized trial for estimating relative treatment effectiveness.

### Statistical Errors in the Medical Literature

This article catalogs several types of statistical problems that occur frequently in the medical journal articles.

### Subjective Ranking of Quality of Research by Subject Matter Area

This is a subjective ranking of topical areas by the typical quality of research published in the area. Keep in mind that top-quality research can occur in any area when the research team is multi-disciplinary, team members are at the top of their game, and peer review is functional.

### Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules

Estimating tendencies is usually a more appropriate goal than classification, and classification leads to the use of discontinuous accuracy scores which give rise to misleading results.

### My Journey from Frequentist to Bayesian Statistics

This is the story of what influenced me to become a Bayesian statistician after being trained as a classical frequentist statistician, and practicing only that mode of statistics for many years.

### A Litany of Problems With p-values

p-values are very often misinterpreted. p-values and null hypothesis significant testing have hurt science. This article attempts to catalog all the ways in which these happen.

### Clinicians’ Misunderstanding of Probabilities Makes Them Like Backwards Probabilities Such As Sensitivity, Specificity, and Type I Error

The error of the transposed conditional is rampant in research. Conditioning on what is unknowable to predict what is already known leads to a host of complexities and interpretation problems.

### Split-Sample Model Validation

The many disadvantages of split-sample validation, including subtle ones, are discussed.

### Classification vs. Prediction

Classification involves a forced-choice premature decision, and is often misused in machine learning applications. Probability modeling involves the quantification of *tendencies* and usually addresses the real project goals.

### Null Hypothesis Significance Testing Never Worked

This article explains why for decision making the original idea of null hypothesis testing never delivered on its goal.

### p-values and Type I Errors are Not the Probabilities We Need

p-values are not what decision makers need, nor are they what most decision makers think they are getting.

## Talks

### My Big Jump: Founding a Department of Biostatistics

For many years biostatistics had been successful at Vanderbilt, but the opportunity to create a department home for biostatistics was too good to pass up. The new department and its support from the School of Medicine leadership made it an attractive place for recruiting new faculty and staff. This talk will cover what made the department attractive, as well as principles upon which the Department of Biostatistics was founded in the Vanderbilt School of Medicine in 2003. These principles include reproducible research and prioritizing collaboration over consultation. Challenges and opportunities of running the department in a growing academic medical center will be discussed, with emphasis on generalizable knowledge that may assist others in starting, sustaining, and enhancing biostatistics groups in their own medical centers.

### Controversies in Predictive Modeling, Machine Learning, and Validation

This talk covers a variety of controversial and/or current issues related to statistical modeling and prediction research. Some of the topics covered are why external validation is often not a good idea, why validating researchers is often more efficient than validating models, what distinguishes statistical models from machine learning, how variable selection only gives the illusion of learning from data, and advantages of older measures of model performance.

### R Workflow for Reproducible Biomedical Research Using Quarto

This work is intended to foster best practices in reproducible data documentation and manipulation, statistical analysis, graphics, and reporting. It will enable the reader to efficiently produce attractive, readable, and reproducible research reports while keeping code concise and clear. Readers are also guided in choosing statistically efficient descriptive analyses that are consonant with the type of data being analyzed.

### Longitudinal Ordinal Models as a General Framework for Medical Outcomes

Univariate ordinal models can be used to model a wide variety of longitudinal outcomes, using only standard software, through the use of Markov processes. This talk will show how longitudinal ordinal models unify a wide variety of types of analyses including time to event, recurrent events, continuous responses interrupted by events, and multiple events that are capable of being placed in a hierarchy. Through the use of marginalization over the previous state in an ordinal multi-state transition model, one may obtain virtually any estimand of interest. Both frequentist and Bayesian methods can be used to fit the model and draw inferences.

### Modernizing Clinical Trial Design and Analysis to Improve Efficiency & Flexibility

This presentation covers several ways to make clinical trials more efficient and to reduce the chance of ending with an equivocal result. Some of the approaches covered are Bayesian sequential designs allowing for study extension if results are promising, not being tied by type I assertion probabilities/α spending, using high-information longitudinal ordinal outcomes, and covariate adjustment.

### Musings on Statistical Models vs. Machine Learning in Health Research

Health researchers and practicing clinicians are with increasing frequency hearing about machine learning (ML) and artificial intelligence applications. They, along with many statisticians, are unsure of when to use traditional statistical models (SM) as opposed to ML to solve analytical problems related to diagnosis, prognosis, treatment selection, and health outcomes. And many advocates of ML do not know enough about SM to be able to appropriately compare performance of SM and ML. ML experts are particularly prone to not grasp the impact of the choice of measures of predictive performance. In this talk I attempt to define what makes ML distinct from SM, and to define the characteristics of applications for which ML is likely to offer advantages over SM, and vice-versa. The talk will also touch on the vast difference between prediction and classification and how this leads to many misunderstandings in the ML world. Other topics to be convered include the minimum sample size needed for ML, and problems ML algorithms have with absolute predictive accuracy (calibration).

### Sequential Bayesian Designs for Rapid Learning in COVID-19 Clinical Trials

Continuous learning from data and computation of probabilities that are directly applicable to decision making in the face of uncertainty are hallmarks of the Bayesian approach. Bayesian sequential designs are the simplest of flexible designs, and continuous learning capitalizes on their efficiency, resulting in lower expected sample sizes until sufficient evidence is accrued due to the ability to take unlimited data looks. Classical null hypothesis testing only provides evidence against the supposition that a treatment has exactly zero effect, and it requires one to deal with complexities if not doing the analysis at a single fixed time. Bayesian posterior probabilities, on the other hand, can be computed at any point in the trial and provide current evidence about all possible questions, such as benefit, clinically relevant benefit, harm, and similarity of treatments.

Besides requiring flexibility in a rapidly changing environment, COVID-19 trials often use ordinal endpoints and standard statistical models such as the proportional odds (PO) model. Less standard is how to model serial ordinal responses. Methods and new Baysian software have been developed for COVID-19 trials. Also implemented is a Bayesian partial PO model (Peterson and Harrell, 1990) that allows one to put a prior on the degree to which a treatment affects mortality differently than how it affects other components of the ordinal scale. These ordinal models will be briefly discussed.

### Bayes for Flexibility in Urgent Times

Continuous learning from data and computation of probabilities that are directly applicable to decision making in the face of uncertainty are hallmarks of the Bayesian approach. Bayesian sequential designs are the simplest of flexible designs, and continuous learning capitalizes on their efficiency, resulting in lower expected sample sizes until sufficient evidence is accrued due to the ability to take unlimited data looks. Classical null hypothesis testing only provides evidence against the supposition that a treatment has exactly zero effect, and it requires one to deal with complexities if not doing the analysis at a single fixed time. Bayesian posterior probabilities, on the other hand, can be computed at any point in the trial and provide current evidence about all possible questions, such as benefit, clinically relevant benefit, harm, and similarity of treatments.

### Fundamental Advantages of Bayes in Drug Development

This presentation covers the limitations of frequentist inference for answering clinical questions and generating evidence for efficacy. Key to understanding efficacy is understanding conditional probability and its relation to information flow. What type I error really controls is discussed, and it is argued that it is not regulator’s regret. The frequentist and Bayesian approaches for stating statistical results for efficacy assesment are contrasted, and a high-level view of the Bayesian approach is given. A key point is the actionability of the statistical results. Some of the advantages of the Bayesian approach are cataloged, with emphasis on forward-information-flow probabilities that instantly define their own error probabilities. Multiplicity non-issues are discussed.

### R for Graphical Clinical Trial Reporting

For clinical trials a good deal of effort goes into producing both final trial reports and interim reports for data monitoring committees, and experience has shown that reviewers much prefer graphical to tabular reports. Interactive graphical reports go a step further and allow the most important information to be presented by default, while inviting the reviewer to drill down to see other details. The drill-down capability, implemented by hover text using the R `plotly`

package, allows one to almost entirely dispense with tables because the hover text can contain the part of a table that pertains to the reviewer’s current focal point in the graphical display, among other things. Also, there are major efficiency gains by having a high-level language for producing common elements of reports related to accrual, exclusions, descriptive statistics, adverse events, time to event, and longitudinal data. This talk will overview the `hreport`

package, which relies on R, `RMarkdown`

, `knitr`

, `plotly`

, `Hmisc`

, and `HTML5`

. `RStudio`

is an ideal report developement environment for using these tools.

### Why Bayes for Clinical Trials?

This presentation covers the limitations of frequentist inference for answering clinical questions and generating evidence for efficacy. Key to understanding efficacy is understanding conditional probability and its relation to information flow. What type I error really controls is discussed, and it is argued that it is not regulator’s regret. The frequentist and Bayesian approaches for stating statistical results for efficacy assesment are contrasted, and a high-level view of the Bayesian approach is given. A key point is the actionability of the statistical results. Some of the advantages of the Bayesian approach are cataloged, with emphasis on forward-information-flow probabilities that instantly define their own error probabilities. Multiplicity issues are discussed, and a simple simulation study is used to demonstrate the lack of multiplicity issues in the Bayesian context even with infinitely many data looks. Some practical guidance for choosing prior distributions is given. Finally, some examples of joint Bayesian inference for multiple endpoints are given.

### R for Clinical Trial Reporting

Statisticians and statistical programmers spend a great deal of time analyzing data and producing reports for clinical trials, both for final trial reports and for interim reports for data monitoring committees. Point and Click interfaces and copy-and-paste are now believed to be bad models for reproducible research. Instead, there are advantages to developing a high-level language for producing common elements of reports related to accrual, exclusions, descriptive statistics, adverse events, time to event, and longitudinal data.

It is well appreciated in the statistical and graphics design communities that graphics are much better than tables for conveying numeric information. There are thus advantages for having statistical reports for clinical trials that are almost completely graphical. Instead of devoting space to tables, `HTML5`

and `Javascript`

in R html reports makes it easy to show tabular information in pop-up text when hovering the mouse over a graphical element.

In this talk I will describe R packages `greport`

(using a \(\LaTeX\) pdf model) and `hreport`

(using an html model). `knitr`

and `Rmarkdown`

are used to compose the reproducible reports. `greport`

and `hreport`

compose all figure and table captions. They contain high-level abstractions of common clinical trial reporting tasks to minimize programming by the use. Before showing examples of these report-making packages, I’ll show some of the new graphical building blocks in the `Hmisc`

and `rms`

packages. These new functions make use of the `plotly`

package to create interactive graphics using `Javascript`

and `D3`

.

### Regression Modeling Strategies

Short course

### Simple Bootstrap and Simulation Approaches to Quantifying Reliability of High-Dimensional Feature Selection

Feature selection in the large p non-large n case is known to be unreliable, but most biomedical researchers are not aware of the magnitude of the problem. They assume for example that setting a false discovery rate makes the results reliable, forgetting about the false negative rate and decades of research showing unreliability of stepwise variable selection even in the low p case. A related problem is the unreliability in the estimate of the effect (e.g., an odds ratio) of a feature found by selecting ‘winners’. This talk will demonstrate some simple bootstrap and Monte Carlo simulation procedures for teaching biomedical researchers how to quantify these problems. One of the bootstrap examples exposes the difficulty of the task by computing confidence intervals for importance rankings of features.

### Using R, Rmarkdown, RStudio, knitr, plotly, and HTML for the Next Generation of Reproducible Statistical Reports

The Vanderbilt Department of Biostatistics has two policies currently in effect:

1. All statistical reports will be reproducible

2. All reports should include all the code used to produce the report, in some fashion

We have succeeded with 1. (mainly using knitr in R) and to a large extent with 2. Some biostatisticians have been concerned about interspersing code with the contents of the report. It has also been challenging to copy some PDF report components (e.g., advanced tables) into word processing documents.

Fortunately R and RStudio have recently added a number of new features that allow for easy creation of HTML notebooks that are viewed with any web browser. This solves the problems listed above and adds new possibilities such as interactive graphics that appear in a self-contained HTML file to post on a collaboration web server or send to a collaborator. Interactive graphics allow the analyst to create more detail (e.g., confidence bands for multiple confidence levels; confidence bands for group differences as well as those for each group individually) with the collaborator able to easily select which details to view.

I have made major revisions in the R Hmisc and rms packages to provide new capabilities that fit into the R/RStudio Rmarkdown HTML notebook framework. Interactive plotly graphics (based on Javascript and D3) and customized HTML output are the main new ingredients. In this talk the rationale for this approach is discussed, and the new features are demonstrated with two statistical reports. A few miscellaneous topics will also be covered, e.g. how to cite bibliographic references in Rmarkdown and how to interface R to citeulike.org for viewing or extracting bibliographic references.

For more information see

https://www.r-project.org

https://www.rstudio.com

http://rmarkdown.rstudio.com

http://rmarkdown.rstudio.com/r_notebooks.html

http://yihui.name/knitr

https://hbiostat.org/R/Hmisc

https://plot.ly/r

https://plot.ly/r/getting-started

ggplotly: a function that converts any ggplot2 graphic to a plotly interactive graphic: https://plot.ly/ggplot2

### Exploratory Analysis of Clinical Safety Data to Detect Safety Signals

It is difficult to design a clinical study to provide sound inferences about safety effects of drugs in addition to providing trustworthy evidence for efficacy. Patient entry criteria and experimental design are targeted at efficacy, and there are too many possible safety endpoints to be able to control type I error while preserving power. Safety analysis tends to be somewhat ad hoc and exploratory. But with the large quantity of safety data acquired during clinical drug testing, safety data are rarely harvested to their fullest potential. Also, decisions are sometimes made that result in analyses that are somewhat arbitrary or that lose statistical efficiency. For example, safety assessments can be too quick to rely on the proportion of patients in each treatment group at each clinic visit who have a lab measurement above two or three times the upper limit of normal.

Safety reports frequently fail to fully explore areas such as

• which types of patients are having AEs?

• what distortions in the tails of the distribution of lab values are taking place?

• which AEs tend to occur in the same patient?

• how to clinical AEs correlate to continuous lab measurements at a given time

• which AEs and lab abnormalities are uniquely related to treatment assigned?

• do preclinically significant measurements at an earlier visit predict AEs at a later visit?

• how can time trends in many variables be digested into an understandable picture?

This talk will demonstrate some of the exploratory statistical and graphical methods that can help answer questions such as the above, using examples based on data from real pharmaceutical trials.