Overview
Design
Results entry at time of analysis
- Timing of analyses and resulting event counts and spending times
- Nominal p-values for each analysis
Testing hypotheses
- Compute sequential p-values for each hypothesis
- Evaluate hypothesis rejection using gMCP
Verification of hypotheses rejected
Session information
References

Overview

This document is intended to evaluate statistical significance for graphical multiplicity control when used with group sequential design (Maurer and Bretz 2013). In particular, we demonstrate design and analysis of a complex oncology trial. There are many critical details building on the necessarily simple example provided by Maurer and Bretz (2013). The combination of tools provided by the gMCP and gsDesign packages is non-trivial, but developed in a way that is meant to generalize easily. This has been found to be particularly valuable to provide a prompt and verifiable conclusion in multiple trials such as Burtness et al. (2019) where 14 hypotheses were evaluated using a template such as this.

Given the complexity involved, substantial effort has been taken to provide methods to check hypothesis testing.

The initial testing is done by using sequential p-values (Liu and Anderson 2008) which can then be plugged into standard graphical testing software (Bretz, Maurer, and Posch 2009).
The graphical testing produces a sequence of updated multiplicity graphs, each with a single hypothesis rejected from the previous graph.
The final graph, assuming not all hypotheses were rejected, provides the final Type I error available for testing each hypothesis that was not rejected.
Updated group sequential bounds for each hypothesis can be checked vs. nominal p-values at each analysis to verify the testing conclusions reached with the above methods.

The table of contents above lays out the organization of the document. In short, we begin with 1) design specification followed by 2) results entry which includes event counts and nominal p-values for testing, 3) carrying out hypothesis testing, and finishing with 4) verification of the hypothesis testing results.

Design

There are 3 endpoints and 2 populations resulting in 6 hypotheses to be tested in the trial. The endpoints are:

Overall survival (OS)
Progression free survival (PFS)
Objective response rate (ORR)

The populations to be studied are:

The overall population (All subjects)
A subgroup (Subgroup)

For simplicity, we design assuming the control group has an exponential time to event with a median of 12 months for OS and 5 months for PFS. We design under a proportional hazards assumption. ORR for the control group is assumed to be 15%. Some of the choices here are arbitrary, but the intent is to fully specify how patients will be enrolled and followed for \(\alpha\)-controlled study analyses.

The following design characteristics are also specified to well-characterize outcomes for all subjects by the end of the trial:

Enrollment is assumed to occur over 18 months. Enrollment will continue until the targeted number of subjects has been enrolled in the subgroup to ensure power as planned for that population. This means, the overall population sample size will be random and power may vary from that planned here.
The first interim analysis will be conducted 6 months after final patient enrolled to adequately assess ORR for all patients. Thus, the analysis is planned at 24 months after start of study enrollment, but will be adapted according to when final enrollment is completed. This is the only analysis for ORR and is an interim analysis for PFS and OS with whatever event counts are available at the cutoff.
The second interim analysis will be conducted 14 months after final enrollment to ensure minimum follow-up almost 3 times the assumed control median for all subjects. This would be delayed up to 3 months if the final targeted event count for PFS in the subgroup is not achieved at that time. This is to ensure a complete description of tail behavior for PFS in the case a PFS curve has a plateau. PFS and OS will be analyzed. The endpoint counts for OS and for the overall population for PFS are random since the cutoff is determined by the PFS endpoint count for the subgroup.
The final analysis will be performed 24 months after final enrollment, ensuring 2 times the median control survival as minimum follow-up for all subjects. Only analysis of OS is planned. The final analysis may be delayed up to 6 months if the targeted OS event count in the subgroup is not achieved. Thus, the planned total duration of the trial for the OS endpoint is 42 months.

The sample size for the trial will be driven by an adequate sample size and targeted events in the subgroup to ensure 90% power for the OS endpoint assuming a hazard ratio of 0.65. For group sequential designs, we assume 1-sided testing.

To reveal needed packages for the remainder of the document, press the code button below.

options(scipen=999)
# colorblind palette
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# 3 packages used for data storage and manipulation: dplyr, reshape2, tibble
library(dplyr)
library(reshape2)
library(tibble)
# 2 packages used for R Markdown capabilities: knitr, kableExtra
library(knitr)
library(kableExtra)
library(ggplot2) # for plotting
library(gsDesign) # Group sequential design capabilities
library(gMCP) # Multiplicity evaluation

Multiplicity diagram for hypothesis testing

Following is the multiplicity graph for the trial design. We have arbitrarily split Type I error equally between the subgroup and overall populations. Most \(\alpha\) is allocated to OS and the least to ORR, with PFS receiving an intermediate amount. This reflects the priority of the endpoints as well as the practicality to detect clinically significant differences in each population. Reallocation for each endpoint proceeds from the subgroup to the overall population. If the overall population hypothesis is rejected for a given endpoint, the reallocation is split between the two populations for another endpoint.

# If needed, see help file for gsDesign::hGraph() for explanation of parameters below
# Hypothesis names
nameHypotheses <- c("H1: OS\n Subgroup",
                    "H2: OS\n All subjects",
                    "H3: PFS\n Subgroup",
                    "H4: PFS\n All subjects",
                    "H5: ORR\n Subgroup",
                    "H6: ORR\n All subjects")
# Number of hypotheses to be tested
nHypotheses <- length(nameHypotheses)
# Transition weights for alpha reallocation (square matrix)
m <- matrix(c(
  0,1,0,0,0,0,
  0,0,.5,.5,0,0,
  0,0,0,1,0,0,
  0,0,0,0,.5,.5,
  0,0,0,0,0,1,
  .5,.5,0,0,0,0), nrow=6, byrow=TRUE)
# Initial Type I error assigned to each hypothesis (one-sided)
alphaHypotheses <- c(.01,.01,.002,0.002,0.0005,.0005)
# Make a ggplot representation of the above specification and display it
g <- gsDesign::hGraph(6,alphaHypotheses=alphaHypotheses,m=m,nameHypotheses=nameHypotheses,
       halfWid=1,halfHgt=.35,radius=2.5,radius2=1, offset=0, trhw=.15, 
       x=c(-1.25,1.25,-2.5,2.5,-1.25,1.25), y=c(2,2,1,1,0,0),
       trprop=0.4,fill=as.character(c(2,2,4,4,3,3)))+ scale_fill_manual(values=cbPalette)
g

This testing scheme can result in what might be referred to as time travel for passing of \(\alpha\). That is, if PFS hypotheses are not rejected at a given analysis (say final PFS analysis) and OS hypotheses are rejected at the final analysis, then the previously evaluated PFS tests at the interim and final PFS analysis can be compared to updated bounds based on reallocated Type I error (Liu and Anderson 2008).

Group sequential designs for each hypothesis

For the example, we assume 1-sided testing; for a futility bound we would use test.type=4 or test.type=6 instead of test.type=1 as specified here. For planning purposes, spending for each analysis is based on the fraction of final planned events (information fraction) for that hypothesis. Adaptation of this at the time of analyses will be noted in the next section. Spending for all group sequential designs uses the Lan-DeMets spending function approximating an O’Brien-Fleming bound.

H1: OS, Subgroup

We assume 50% of the population is in the subgroup of interest. A sample size of 378 is driven by overall survival (OS) in the subgroup where we assume a hazard ratio of 0.65.

ossub <- gsDesign::gsSurv (k = 3, test.type = 1, alpha = 0.01, beta = 0.1, hr = 0.65, 
                 timing = c(0.6,0.82), sfu  = sfLDOF, 
                 lambdaC = log(2)/12, eta = 0.001, S = NULL,
                 gamma = c(2.5,5,7.5,10), R = c(2,2,2,12),  
                 T=42, minfup = 24)
tab <- gsDesign::gsBoundSummary(ossub)
rownames(tab) <- 1:nrow(tab)
tab %>% kable(caption = "Design for OS in the subgroup.") %>% kable_styling()

Design for OS in the subgroup.
Analysis	Value	Efficacy
IA 1: 60%	Z	3.1270
N: 378	p (1-sided)	0.0009
Events: 171	~HR at bound	0.6191
Month: 24	P(Cross) if HR=1	0.0009
	P(Cross) if HR=0.65	0.3786
IA 2: 82%	Z	2.6382
N: 378	p (1-sided)	0.0042
Events: 233	~HR at bound	0.7075
Month: 32	P(Cross) if HR=1	0.0044
	P(Cross) if HR=0.65	0.7472
Final	Z	2.3822
N: 378	p (1-sided)	0.0086
Events: 284	~HR at bound	0.7535
Month: 42	P(Cross) if HR=1	0.0100
	P(Cross) if HR=0.65	0.9000

H2: OS, All

The total sample size is assumed to be twice the above, N=756. Power and calendar timing must match that above. The power is slightly above 90% for a hazard ratio of 0.74.

hr <- .74
beta <- .098
os    <- gsDesign::gsSurv (k = 3, test.type = 1, alpha = 0.01, beta = beta, hr = hr, 
                 timing = c(0.6,0.82), sfu  = sfLDOF, 
                 lambdaC = log(2)/12, eta = 0.001, S = NULL,
                 gamma = c(2.5,5,7.5,10), R = c(2,2,2,12),  
                 T=42, minfup = 24)
tab <- gsDesign::gsBoundSummary(os)
rownames(tab) <- 1:nrow(tab)
tab %>% kable(caption = "Design for OS in all subjects") %>% kable_styling()

Design for OS in all subjects
Analysis	Value	Efficacy
IA 1: 60%	Z	3.1270
N: 756	p (1-sided)	0.0009
Events: 352	~HR at bound	0.7163
Month: 24	P(Cross) if HR=1	0.0009
	P(Cross) if HR=0.74	0.3820
IA 2: 82%	Z	2.6382
N: 756	p (1-sided)	0.0042
Events: 481	~HR at bound	0.7860
Month: 32	P(Cross) if HR=1	0.0044
	P(Cross) if HR=0.74	0.7505
Final	Z	2.3822
N: 756	p (1-sided)	0.0086
Events: 586	~HR at bound	0.8213
Month: 42	P(Cross) if HR=1	0.0100
	P(Cross) if HR=0.74	0.9020

H3: PFS, Subgroup

For progression free survival (PFS) we assume a shorter median time to event of 5 months. With an assumed hazard ratio of 0.65, we adjust beta and timing to match the targeted sample size and interim analysis timing. We assume a larger dropout rate for PFS than we did for OS.

hr <- .65
beta <- .204
pfssub<- gsDesign::gsSurv (k = 2, test.type = 1, alpha = 0.002, beta = beta, hr = hr, 
                 timing = .87, sfu  = sfLDOF, 
                 lambdaC = log(2)/5, eta = 0.02, S = NULL,
                 gamma = c(2.5,5,7.5,10), R = c(2,2,2,12),  
                 T=32, minfup = 14)
tab <- gsDesign::gsBoundSummary(pfssub)
rownames(tab) <- 1:nrow(tab)
tab %>% kable(caption = "Design for PFS in the subgroup") %>% kable_styling()

Design for PFS in the subgroup
Analysis	Value	Efficacy
IA 1: 87%	Z	3.1140
N: 378	p (1-sided)	0.0009
Events: 259	~HR at bound	0.6787
Month: 24	P(Cross) if HR=1	0.0009
	P(Cross) if HR=0.65	0.6416
Final	Z	2.9245
N: 378	p (1-sided)	0.0017
Events: 297	~HR at bound	0.7121
Month: 32	P(Cross) if HR=1	0.0020
	P(Cross) if HR=0.65	0.7960

H4: PFS, All

Finally, we design for PFS in all subjects.

hr <- .74
beta <- .172
pfs   <- gsDesign::gsSurv (k = 2, test.type = 1, alpha = 0.003, beta = beta, hr = hr, 
                 timing = .86, sfu  = sfLDOF, 
                 lambdaC = log(2)/5, eta = 0.02, S = NULL,
                 gamma = c(2.5,5,7.5,10), R = c(2,2,2,12),  
                 T=32, minfup = 14)
tab <- gsDesign::gsBoundSummary(pfs)
rownames(tab) <- 1:nrow(tab)
tab %>% kable(caption = "Design for PFS in the overall population") %>% kable_styling()

Design for PFS in the overall population
Analysis	Value	Efficacy
IA 1: 86%	Z	2.9947
N: 756	p (1-sided)	0.0014
Events: 523	~HR at bound	0.7694
Month: 24	P(Cross) if HR=1	0.0014
	P(Cross) if HR=0.74	0.6745
Final	Z	2.7955
N: 756	p (1-sided)	0.0026
Events: 608	~HR at bound	0.7970
Month: 32	P(Cross) if HR=1	0.0030
	P(Cross) if HR=0.74	0.8280

H5 and H6: ORR

For objective response rate (ORR), we assume an underlying control rate of 15%. In the subgroup population, we have almost 90% power to detect a 20% improvement.

nBinomial(p1=.35,p2=.15,alpha=.0005,n=378)

## [1] 0.8911724

In the all subjects population we have approximately 95% power to detect an improvement in ORR from 15% to 30%.

nBinomial(p1=.3,p2=.15,alpha=.0005,n=756)

## [1] 0.9530369

Design list

Now we associate designs with hypotheses in an ordered list corresponding to the order in the multiplicity graph setup. Since ORR designs are not group sequential, we enter NULL values for those in the last 2 entries of the design list; hit code button to reveal code for this.

gsDlist <- list(ossub,os,pfssub,pfs,NULL,NULL)

Spending plan and spending time

While it was relatively straightforward above to set up timing of analyses to match for the different hypotheses, accumulation of endpoints can vary from plan in a variety of ways. Planning on how to deal with this is critical at the time of protocol development to avoid later amendments or inappropriate \(\alpha\)-allocation to early analyses. Before going into examples, we review the concept of \(\alpha\)-spending and what we will refer to as spending time.

For a given hypothesis, we will assign a non-decreasing spending function \(f(t)\) defined for \(t\ge 0\) with \(f(0)=0\) and \(f(t)=\alpha\) for \(t\ge 1\). We will assume \(K\) analyses with observed event counts \(n_k\) at analysis \(k=1,2,\ldots,K\) and a targeted final event count of \(N_k\). The \(\alpha\)-spending at analysis \(k\) was originally defined (K. K. G. Lan and DeMets 1983) as \(f(t_k=n_k/N_K)\). The values \(n_k/N_K\) will be referred to as the information fraction, \(k=1,\ldots,K\). This is used to pre-specify the cumulative amount of Type I error for a hypothesis at each analysis. In K. K. G. Lan and DeMets (1989) they noted that calendar time was another option for \(t_k\) values, \(k=1,\ldots,K.\) Proschan, Lan, and Wittes (2006) noted further that as long as \(t_k\) is increasing with \(k\), it can be used to define spending; this is subject to the requirement that under the null hypothesis, the timing must be selected in a way that is not correlated with the test statistic (e.g., blinded). We will refer to \(t_k\), regardless of its definition, as the spending time for a hypothesis. Note that the joint distribution of interim and final tests for a hypothesis is driven by \(n_k\), \(k=1,\ldots,K\). This is equivalent to basing correlation on the information fraction \(n_k^{(actual)}/n_K^{(planned)}\), \(1\le k\le K\). Thus, both spending time and information fraction are required to compute bounds for group sequential testing. Our general objectives here will be to:

Spend all Type I error for each hypothesis in its combined interim and final analyses; this requires the spending time to be 1 for the final analysis of a hypothesis.
Ensure spending time is well defined for each analysis of each hypothesis.
We will assume that both follow-up duration and event counts may be of interest in determining timing of analyses; e.g., for immuno-oncology therapies there have been delayed treatment effects and the tail of the time-to-event distribution has been important to establish benefit. Thus, we will assume here that over-spending at interim analysis is to be avoided.

Here we assume that the subgroup prevalence was over-estimated in the study design and indicating how spending time can be used to deal with this deviation from plan.

Results entry at time of analysis

Results for each analysis performed should be entered here. We begin by documenting timing and event counts of each analysis. Then we proceed to enter nominnal 1-sided testing p-values for each analysis of each hypothesis.

Timing of analyses and resulting event counts and spending times

Recall that the design assumed 50% prevalence of the subgroup. Here we assume that the observed prevalence is 40% and that, by specification stated above, we enroll until the targeted subpopulation of 378 is achieved. This is assumed to occur after 22 months with a total enrollment of 940. Timing of analyses is now targeted as follows:

The first interim is scheduled 28 months, 6 months after final enrollment.
The second interim is scheduled at the later of 14 months after final enrollment (22 + 14 = 36 months after start of enrollment) or the targeted final PFS event count of 297 events. We assume the event count is reached at 34 months and that the achieved final event count is 320 in the subgroup at 36 months.
The final analysis is scheduled at 24 months after final enrollment (month 22 + 24 = 46) or when 284 events have been observed in the subgroup, whichever comes first; there is also the qualification that the final analysis will be no more than 30 months after final enrollment (6 months after targeted time). We assume the targeted event count is not reached by 6 months after the targeted final analysis time and, thus, the final analysis cutoff is set at month 22 + 30 = 52 and that at that time 270 OS events have been observed in the subgroup.

All of the above leads to event counts and spending for PFS and OS as follows:

# PFS, overall population
pfs$n.I <- c(675,750)
# PFS, subgroup
pfssub$n.I <- c(265,310)
# OS, overall population
os$n.I <- c(529,700,800)
# OS, subgroup
ossub$n.I <- c(185,245,295)

Nominal p-values for each analysis

For analyses not yet performed enter dummy values, including a p-value near 1 (e.g., .99). No other entry is required by the user in any other section of the document. Calendar timing is also associated with PFS hypotheses for use in spending functions. Spending time for OS spending will be input as NULL so that spending will be based on event counts for OS hypotheses.

inputResults <- tibble(H=c(rep(1,3),rep(2,3),rep(3,2),rep(4,2),5,6),
                 Pop=c(rep("Subgroup",3),rep("All",3),
                       rep("Subgroup",2),rep("All",2),
                       "Subgroup","All"),
                 Endpoint=c(rep("OS",6),rep("PFS",4),rep("ORR",2)),
                 nominalP=c(.03,.0001,.000001,
                            .2,.15,.1,
                            .2,.001,
                            .3,.2,
                            .00001,
                            .1),
                 Analysis=c(1:3,1:3,1:2,1:2,1,1),
                 events=c(ossub$n.I,os$n.I,pfssub$n.I,pfs$n.I,NA,NA),
                 spendingTime=c(ossub$n.I/max(ossub$n.I),
                                ossub$n.I/max(ossub$n.I),
                                pfssub$n.I/max(pfssub$n.I),
                                pfssub$n.I/max(pfssub$n.I),
                                NA,NA))
kable(inputResults,caption="DUMMY RESULTS FOR IA2.") %>%
   kable_styling() %>%
   add_footnote("Dummy results", notation="none")

DUMMY RESULTS FOR IA2.
H	Pop	Endpoint	nominalP	Analysis	events	spendingTime
1	Subgroup	OS	0.030000	1	185	0.6271186
1	Subgroup	OS	0.000100	2	245	0.8305085
1	Subgroup	OS	0.000001	3	295	1.0000000
2	All	OS	0.200000	1	529	0.6271186
2	All	OS	0.150000	2	700	0.8305085
2	All	OS	0.100000	3	800	1.0000000
3	Subgroup	PFS	0.200000	1	265	0.8548387
3	Subgroup	PFS	0.001000	2	310	1.0000000
4	All	PFS	0.300000	1	675	0.8548387
4	All	PFS	0.200000	2	750	1.0000000
5	Subgroup	ORR	0.000010	1	NA	NA
6	All	ORR	0.100000	1	NA	NA
Dummy results

Testing hypotheses

Compute sequential p-values for each hypothesis

Sequential p-value computation is done in one loop in an attempt to minimize chances for coding errors. We delay showing these until after display of the sequence of multiplicity graphs generated by hypothesis rejection is shown.

EOCtab <- NULL
EOCtab <- inputResults %>%
          group_by(H) %>%
          slice(1) %>%
          ungroup() %>%
          select("H","Pop","Endpoint","nominalP")
EOCtab$seqp=.9999
for(EOCtabline in 1:nHypotheses){
    EOCtab$seqp[EOCtabline] <-
      ifelse(is.null(gsDlist[[EOCtabline]]),EOCtab$nominalP[EOCtabline],
      { tem <- filter(inputResults,H==EOCtabline)
        sequentialPValue(gsD = gsDlist[[EOCtabline]],interval=c(.0001,.9999),
                         n.I = tem$events,
                         Z = -qnorm(tem$nominalP),
                         usTime = tem$spendingTime)
      })
}
EOCtab <- EOCtab %>% select(-"nominalP")
# kable(EOCtab,caption="Sequential p-values as initially placed in EOCtab") %>% kable_styling()

Evaluate hypothesis rejection using gMCP

We need to set up a graph object as implemented in the gMCP package.

# make a graph object
rownames(m) <- nameHypotheses
graph <- matrix2graph(m)
# add weights to the object based on alpha allocation
graph <- setWeights(graph,alphaHypotheses/.025)
rescale <- 45
d <- g$layers[[2]]$data
rownames(d) <- rownames(m)
# graph@nodeAttr$X <- rescale * d$x * 1.75
# graph@nodeAttr$Y <- -rescale * d$y * 2

Now we add the sequential p-values and evaluate which hypotheses have been rejected.

result <- gMCP(graph=graph, pvalues=EOCtab$seqp, alpha=.025)
result@rejected

##      H1: OS\n Subgroup  H2: OS\n All subjects     H3: PFS\n Subgroup 
##                   TRUE                  FALSE                   TRUE 
## H4: PFS\n All subjects     H5: ORR\n Subgroup H6: ORR\n All subjects 
##                  FALSE                   TRUE                  FALSE

# now map back into EOCtable (CHECK AGAIN!!!)
EOCtab$Rejected <- result@rejected
EOCtab$adjPValues <- result@adjPValues

Verification of hypotheses rejected

# number of graphs is used repeatedly
ngraphs <- length(result@graphs)
# Set up tibble with hypotheses rejected at each stage
rejected <- NULL
for(i in 1:length(result@graphs)){
  rejected <- rbind(rejected,
                    tibble(H=1:nHypotheses,Stage=i,
                           Rejected=as.logical(result@graphs[[i]]@nodeAttr$rejected)))
}
rejected <- rejected %>%
  filter(Rejected) %>%
  group_by(H) %>%
  summarize(graphRejecting=min(Stage)-1) %>% # last graph with weight>0 where H rejected
  arrange(graphRejecting)
# get final weights
# for hypotheses not rejected, this will be final weight where
# no hypothesis could be rejected
lastWeights <- as.numeric(result@graphs[[ngraphs]]@weights)
lastGraph <- rep(ngraphs,nrow(EOCtab))
# we will update for rejected hypotheses with last positive weight for each
if (ngraphs > 1)for(i in 1:(ngraphs-1)){
  lastWeights[rejected$H[i]] <- as.numeric(result@graphs[[i]]@weights[rejected$H[i]])
  lastGraph[rejected$H[i]] <- i 
}
EOCtab$lastAlpha <- .025 * lastWeights
EOCtab$lastGraph <- lastGraph
EOCtabx <- EOCtab
names(EOCtabx) <- c("Hypothesis","Population","Endpoint","Sequential p",
                   "Rejected","Adjusted p","Max alpha allocated","Last Graph")
# display table with desired column order
# delayed following until after multiplicity graph sequence 
#EOCtabx %>% select(c(1:4,7,5:6,8)) %>% kable() %>% kable_styling()

Multiplicity graph sequence from gMCP

Graph 1

Graph 2

Graph 3

Graph 4

Comparison of sequential p-values to multiplicity graphs

We can compare sequential p-values to available \(\alpha\) in each graph. In the column ‘Last Graph’ we can see one of 2 things:

For rejected hypotheses, the maximum \(\alpha\) allocated to the hypothesis. For example, hypothesis one was allocated \(\alpha=0.01\) in the first graph above (select using first tab). We see that the sequential p-value of 0.0001 is smaller than \(\alpha=0.01\) and thus the hypothesis is rejected. We can then proceed to the second graph and see that hypothesis 5 was rejected. The last hypothesis rejected is hypothesis 3 in the third graph.
For the remaining hypotheses (H2, H4, H6) the maximum \(\alpha\) allocated is in the fourth graph; since each sequential p-value is greather than the allocated \(\alpha\) for the corresponding hypothesis, none of these hypotheses were rejected.

EOCtabx %>% select(c(1:4,7,5:6,8)) %>% kable() %>% kable_styling()

Hypothesis	Population	Endpoint	Sequential p	Max alpha allocated	Rejected	Adjusted p	Last Graph
1	Subgroup	OS	0.0001000	0.0100	TRUE	0.0002500	1
2	All	OS	0.1232177	0.0200	FALSE	0.1540221	4
3	Subgroup	PFS	0.0011310	0.0020	TRUE	0.0141370	3
4	All	PFS	0.2355583	0.0040	FALSE	0.2453732	4
5	Subgroup	ORR	0.0000100	0.0005	TRUE	0.0005000	2
6	All	ORR	0.1000000	0.0010	FALSE	0.2453732	4

Bounds at final \(\alpha\) allocated for group sequential tests

As a separate validation, we examine group sequential bounds for each hypothesis updated with 1) the maximum \(\alpha\) allocated above, 2) the number of events at each analysis, and 3) the cumulative spending at each analysis above. The nominal p-value for at least one of the analyses performed for each rejected hypotheses should be less than or equal to the nominal p-value in the group sequential design. For each hypothesis not rejected, all nominal p-values are greater than the its corresponding bound. For hypotheses tested without a group sequential design, the nominal p-value for the test of that hypothesis can be compared to the maximum alpha allocated in the above table.

Hypothesis 1

Nominal p-values at each analysis for comparison to bounds in table below:

0.03 0.0001 0.000001

Max alpha allocated from above table: 0.01
Analysis	Value	Efficacy
IA 1: 65%	Z	3.0503
Events: 185	p (1-sided)	0.0011
IA 2: 86%	Z	2.6238
Events: 245	p (1-sided)	0.0043
Final	Z	2.3861
Events: 295	p (1-sided)	0.0085

Hypothesis 2

Nominal p-values at each analysis for comparison to bounds in table below:

0.2 0.15 0.1

Max alpha allocated from above table: 0.02
Analysis	Value	Efficacy
IA 1: 90%	Z	2.7157
Events: 529	p (1-sided)	0.0033
IA 2: 120%	Z	2.3386
Events: 700	p (1-sided)	0.0097
Final	Z	2.1098
Events: 800	p (1-sided)	0.0174

Hypothesis 3

Nominal p-values at each analysis for comparison to bounds in table below:

0.2 0.001

Max alpha allocated from above table: 0.002
Analysis	Value	Efficacy
IA 1: 89%	Z	3.1449
Events: 265	p (1-sided)	0.0008
Final	Z	2.9201
Events: 310	p (1-sided)	0.0017

Hypothesis 4

Nominal p-values at each analysis for comparison to bounds in table below:

0.3 0.2

Max alpha allocated from above table: 0.004
Analysis	Value	Efficacy
IA 1: 111%	Z	2.9023
Events: 675	p (1-sided)	0.0019
Final	Z	2.6840
Events: 750	p (1-sided)	0.0036

Hypothesis 5

Maximum alpha allocated: 0.0005

Nominal p-value for hypothesis test: 0.00001

Hypothesis 6

Maximum alpha allocated: 0.001

Nominal p-value for hypothesis test: 0.1

Session information

The following documents the versions of R and packages used to develop this document. Note, in particular, that version 3.1 or later of the gsDesign package is needed.

sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] gMCP_0.8-14         gsDesign_3.1.0.9002 xtable_1.8-4       
## [4] ggplot2_3.2.1       kableExtra_1.1.0    knitr_1.25         
## [7] tibble_2.1.3        reshape2_1.4.3      dplyr_0.8.3        
## 
## loaded via a namespace (and not attached):
##  [1] rstudioapi_0.10      magrittr_1.5         TH.data_1.0-10      
##  [4] gtable_0.3.0         rmarkdown_1.16       vctrs_0.2.0         
##  [7] hms_0.5.2            xml2_1.2.2           webshot_0.5.1       
## [10] pillar_1.4.2         htmltools_0.4.0      stringr_1.4.0       
## [13] splines_3.5.0        CommonJavaJars_1.0-6 rJava_0.9-11        
## [16] lattice_0.20-35      survival_2.41-3      tidyselect_0.2.5    
## [19] plyr_1.8.4           sandwich_2.5-1       zoo_1.8-6           
## [22] pkgconfig_2.0.3      Matrix_1.2-14        R6_2.4.0            
## [25] digest_0.6.22        xfun_0.10            stats4_3.5.0        
## [28] colorspace_1.4-1     stringi_1.4.3        yaml_2.2.0          
## [31] lazyeval_0.2.2       codetools_0.2-15     xlsxjars_0.6.1      
## [34] evaluate_0.14        labeling_0.3         httr_1.4.1          
## [37] compiler_3.5.0       withr_2.1.2          JavaGD_0.6-1.1      
## [40] backports_1.1.5      munsell_0.5.0        Rcpp_1.0.2          
## [43] highr_0.8            zeallot_0.1.0        MASS_7.3-49         
## [46] assertthat_0.2.1     readr_1.3.1          PolynomF_2.0-2      
## [49] tools_3.5.0          mvtnorm_1.0-11       viridisLite_0.3.0   
## [52] scales_1.0.0         crayon_1.3.4         glue_1.3.1          
## [55] purrr_0.3.3          rlang_0.4.1          multcomp_1.4-10     
## [58] rvest_0.3.4          grid_3.5.0

References

Bretz, Frank, Willi Maurer, and Martin Posch. 2009. “A Graphical Approach to Sequentially Rejective Multiple Test Procedures.” Statistics in Medicine 28: 586–604. doi:10.1002/sim.3495.

Burtness, Barbara, Kevin J Harrington, Richard Greil, Denis Soulières, Makoto Tahara, Gilberto de Castro Jr, Amanda Psyrri, et al. 2019. “Pembrolizumab Alone or with Chemotherapy Versus Cetuximab with Chemotherapy for Recurrent or Metastatic Squamous Cell Carcinoma of the Head and Neck (Keynote-048): A Randomised, Open-Label, Phase 3 Study.” The Lancet 394 (10212). Elsevier: 1915–28.

Lan, K. K. G., and David L. DeMets. 1983. “Discrete Sequential Boundaries for Clinical Trials.” Biometrika 70: 659–63.

———. 1989. “Group Sequential Procedures: Calendar Versus Information Time.” Statistics in Medicine 8: 1191–8. doi:10.1002/sim.4780081003.

Liu, Qing, and Keaven M. Anderson. 2008. “On Adaptive Extensions of Group Sequential Trials for Clinical Investigations.” Journal of the American Statistical Association 103: 1621–30. doi:10.1198/016214508000000986.

Maurer, Willi, and Frank Bretz. 2013. “Multiple Testing in Group Sequential Trials Using Graphical Approaches.” Statistics in Biopharmaceutical Research 5: 311–20. doi:10.1080/19466315.2013.807748.

Proschan, Michael A., K. K. Gordon Lan, and Janet Turk Wittes. 2006. Statistical Monitoring of Clinical Trials. a Unified Approach. New York, NY: Springer.

Graphical testing for group sequential design

Overview

Design

Multiplicity diagram for hypothesis testing

Group sequential designs for each hypothesis

H1: OS, Subgroup

H2: OS, All

H3: PFS, Subgroup

H4: PFS, All

H5 and H6: ORR

Design list

Spending plan and spending time

Results entry at time of analysis

Timing of analyses and resulting event counts and spending times

Nominal p-values for each analysis

Testing hypotheses

Compute sequential p-values for each hypothesis

Evaluate hypothesis rejection using gMCP

Verification of hypotheses rejected

Multiplicity graph sequence from gMCP

Graph 1

Graph 2

Graph 3

Graph 4

Comparison of sequential p-values to multiplicity graphs

Bounds at final \(\alpha\) allocated for group sequential tests

Hypothesis 1

Hypothesis 2

Hypothesis 3

Hypothesis 4

Hypothesis 5

Hypothesis 6

Session information

References