Regulatory Aspects of Quality of Life
Clare Gnecco and Peter A. Lachenbruch
Abstract: We discuss some issues in using quality of life endpoints in studies that will lead to an application to license a product or add an indication to an already licensed product. Studies of this sort should be double blinded, randomized and use validated questionnaires. The duration of the study should be appropriate for the indication. Missing values can be a serious problem and plans for handling them should be included. Sensitivity analyses are important in this context. An example of a sensitivity analysis shows how substitutions for missing values can offer insight into the effect of these missing values. An alternative model is given to analyze data of this sort. Other analytic methods are discussed in the second part of the paper.
The views and opinions expressed in this paper are those of the authors and do not necessarily represent those of the Food and Drug Administration.
Introduction:
Health Related quality of life (HRQOL) studies can provide supporting evidence to a regulatory agency for a labeling indication for a pharmaceutical product. The product may be a synthesized drug or a biological therapeutic product. The issues largely revolve around issues of experimental design, instrument design, and analysis.
Health Related Quality of Life is an ill-defined term. The WHO defines health as "a state of complete physical, mental and social well-being, and not merely the absence of disease." Thus, Health Related Quality of Life will measure these three (or four) components. HRQOL may be broadly measured as a response to a validated questionnaire (or questionnaires). Validation of the questionnaire is a complex issue that is discussed in other contributions to this conference and in many other sources in the psychometric literature. The questionnaires may be related to general health or a specific health issue. Some examples of questionnaires are
ADL - Activities of Daily living general health questionnaire
ESSI - a seven item questionnaire on social support
EORTC - QLQ-30 - a 30 item questionnaire for cancer clinical trials
FACT-An - a 13 item subscale of the FACT (for cancer trials) that addresses issues related to anemia.
HAQ - Health Assessment Questionnaire - used in arthritis studies
PAR-Q - a questionnaire relating to readiness for physical activities
SAQ - the Seattle Angina Questionnaire (20 questions)
SDS - a symptom distress scale (18 questions)
SF-36 - the RAND Corporation 36 item questionnaire on general health
Many dimensions of quality of life can be measured. These may include general health, physical functioning (can the subject accomplish basic activities?), social function (support networks), mental health (depression, anxiety, etc.). The questionnaires should be validated within the context they will be used. Thus, a scale that has been shown'valid' for patients with cardiovascular disease may not be appropriate for cancer patients. The investigators may not know this unless an attempt is made to study the questionnaire in the oncology setting. A series of anecdotes cannot serve as a measure of QOL, although such a series can serve as the basis for selecting or developing an instrument.
Issues:
Design
Since responses to questionnaires are usually subjective, there is a substantial potential for a biased response if the subject is aware of what treatment he has received. This potential for a placebo effect makes it important to conduct the study in a double-blinded randomized manner. There is often the possibility of a side effect or laboratory value that can unblind the investigator or patient. In addition, cancer trials can have study arms using chemotherapy cycles of unequal lengths that could unmask the patient. If unblinding is a concern, one can have the questionnaire administered by a study assistant who remains masked to the treatment. Alternatively, it is sometimes possible for the questionnaire to be self-administered or computer-administered. This can remove the effect of possible investigator cues to the subject, but cannot remove bias due to known and predictable side effects that unblind the patient. Comparator groups are important in this context. If the standard of care has similar side effects, this can reduce the potential for bias. It should be noted that the informed consent forms should indicate what side effects are known, so that patients can be quite sophisticated in determining the treatment they are receiving.
Interpretation:
The regulatory agency will want to understand the meaning of the HRQOL response and interpret it in the context of the disease. For this reason, any HRQOL instrument should be described and submitted to the regulatory agency. The scoring system needs to be understood by the regulatory group. Submitting the instrument's scoring manual can be helpful to the regulatory agency. References from peer reviewed publications can discuss the validation of the instrument. If pertinent, additional studies in a relevant patient population may be submitted so that a meaningful assessment of the instrument's performance in that group can be done.
Reliability and Validity
Reliability and validation issues continually arise in HRQOL studies. It is well known that the validity of an instrument depends on the method of administration, culture, language, and disease (among others) of the population. A study described by Dr. Wittes (in this conference) found that listing all words beginning with a given letter of the alphabet (a measure of neurologic function) depended on the subject's native language. A language that is not based on the Roman alphabet (e.g., Chinese) would have different responses to such a question. Similarly, questions relating to social support in the elderly might be answered differently in a culture that had universal health care and one that did not. If a measure has not been used in a disease population before, it will be important to establish the measure's reliability and validity before using it in a phase III study. Thus, a depression questionnaire that works in a normal population may not work in a population with rheumatoid arthritis (RA). One study (Berkanovic and Hurwicz, 1990, Hurwicz and Berkanovic, 1993) found that twice as many RA patients scored in the depressed range than did normal subjects. Is this evidence of clinical depression in these patients, or a realistic assessment of their life condition? If one of the major questions of interest is to show that a new treatment has a positive effect on depression, then the depressed state should be an eligibility criterion. In addition, a comparator group becomes essential since the depression scale may be elevated in the population. The outcome of the study may be to show a change in the depression scale, or it may be to show that a smaller proportion of the treated group scores as depressed after therapy than the comparator group. These may be quite distinct outcomes. A scale that gives results about general health may not encompass responses related to social support or mental health.
Many HRQOL scales are proprietary and their scoring algorithms not available to the public. This may be problematic if the agency wishes to examine them. The FDA must maintain confidentiality of all submissions, so if the proprietor of the HRQOL scale wishes to submit the scoring algorithm directly to the agency, it is possible to keep the proprietary nature of the scale.
A HRQOL instrument must be shown valid in the population being studied. This will usually mean that the developer of the scale will have validated the instrument in a study for this condition prior to the pivotal trial. One cannot validate the instrument in the same clinical trial that is being submitted for approval. The pitfalls of such a strategy include false positive error inflation, lack of replicability, and lack of extensibility to a more heterogeneous patient group where the treatment would be applied. We take as the definition of validity that the HRQOL scale measures what it is purported to measure. Various measures of validity may be considered. First, if a widely agreed upon measure of HRQOL exists (but may be difficult or impossible to measure in clinical trials), a HRQOL measure can be compared to the gold standard. An example might be a long form of a widely used scale that took 45 minutes to complete, and a user wanted to develop a short form that took 5 minutes. Comparing these scales would constitute criterion validity. The HRQOL measure may need to reflect a range of responses in the disease. Thus, in rheumatoid arthritis, a measure that looked at only the length of time to walk 25 feet might not reflect the level of joint pain, or a global evaluation by the physician. This is known as content validity: there is a range of the disease's responses that are being captured. Thirdly, the clinicians in a field usually have some ideas about how subjects with the disease should respond to the questions. The degree to which this happens is called construct validity. Scientists wish the scale to be responsive to changes in the disease. These changes are responses over time and may be due to evolution of the disease or a response to treatment or familiarity with the instrument. Several approaches have been proposed to measure this. Let D be the difference of population means, s be the standard deviation of the measure at baseline, sD be the standard error of the difference in means, and sS be the standard deviation among 'stable' subjects (those subjects in the control group who show little difference between baseline and final values). Then, the effect size is defined as D/s, the difference of population means divided by the baseline standard deviation. This is often used to estimate sample size requirements. The standardized response mean is D/sD, the difference in means divided by the standard error of the mean difference. Finally, the responsiveness statistic is D/sS , the mean difference divided by its standard error among stable subjects. There is an extensive focus on these topics in the psychometric literature. In the regulatory context, it is usually necessary to define a clinically important difference (what would affect the patient's life) and minimally important difference (MID). An effect size is convenient in estimating a sample size, but may have limited relevance to a clinically important difference. Additionally, since repeated measures analyses are often applied that involve assumptions about the covariance matrix, the sample size estimation is rarely as simple as a t-test computation for a specific effect size.
The HRQOL measure must also be reliable. Reliability refers to the ability to measure the HRQOL trait repeatedly and consistently. Test-retest reliability implies that if the test is repeated the outcome will be similar. There are pitfalls here, of course. The time between the two tests must be sufficiently long that question recall doesn't affect the estimate of the reliability. Similarly, the time difference can't be too long since there might be differences due to disease progression or cure. These considerations hold whether the questionnaire is self-administered or given by an interviewer. A second reliability question relates to internal consistency of the questions. This is a function of the number of items and their correlation. Cronbach's a is an intraclass correlation that's widely used in this context. A third reliability measure is inter-rater reliability, the ability of two raters to agree on the HRQOL score. See Staquet, et al (1998) for more details in this area. This may not be appropriate for patient reported outcomes since there cannot be multiple raters.
In measuring HRQOL, the measurements should be taken over time. This allows examination of the evolution of the score. The pattern of the score may be a linear, quadratic, or a threshold model, or something more complex. The timing of measurements should be precise - that is, if measurements are to be taken every two weeks, gaps of 5 and 6 weeks will make it difficult to interpret the measurements for that individual. In chronic diseases, it's important that the study be able to reflect a durable effect on HRQOL. Thus, if the disease leads to mortality within one year, a short-term response (1 month? 3 months? 6 months?) may be acceptable. If the disease has a long-term survival (breast cancer, diabetes), there is a need to show a longer-term benefit (6 months? 12 months? 24 months?). This implies longer follow-up in the studies. Finally, there is a potential bias issue and measurement error problem if the agreed upon HRQOL assessment time points are not adhered to rigorously. Overly liberal time windows are to be avoided.
Missing values plague all studies. It is important to recognize that in HRQOL studies, death is not a missing value. It corresponds to a poor outcome. However, it may not correspond to the worst possible HRQOL since there may be some dire disease states that are worse than death (e.g., constant agonizing pain without the ability to communicate needs). In some scales, a score of 0 may be assumed for death. However, other scales do not have a clearly defined minimum value. In such cases a decision should be made prospectively about the score that will be assigned for death and for dropouts for other reasons. When the data are collected over time, a missing time point is more serious than a single missing item. Appropriate rules can be constructed for the latter. For the former (the missed visit), plans for several sensitivity analyses are appropriate. Imputation methods can be outlined in the study protocol. The most serious missing values arise from study dropouts. In this case, all observations from one time onward may be missing. If possible, the subjects should be encouraged to obtain evaluations on schedule to reduce the number of missing values as much as possible. In addition, efforts should be made to obtain evaluations for a period after dropout. Such information is an important component in structuring sensitivity analyses.
The outcome of the clinical trial is used to determine whether a product will be approved (or a new indication listed on the label of an existing product). Thus, it is crucial to pre-specify what the outcome will be. If a subscale is to be used, it must be specified before the study is unblinded, preferably well before that time. It is not acceptable to examine the data and then select an outcome variable. If several scales are co-primary endpoints, plans to adjust for false positive error inflation due to multiplicity (multiple endpoints) need to be specified in the protocol.
It is important to ensure that the trial sponsor and the FDA have agreed on the design of the study, sample size, outcome measures, and analysis. The FDA strongly encourages trial sponsors to discuss the trial with them, either by telephone or in person. The sponsor should submit plans for studies as an IND at least 30 days in advance of the meeting so that FDA staff can review and react to it. It is in everyone's interest to have trials that are designed and conducted to meet their stated objectives.
Statistical Analytic Issues:
Features that make the analysis of HRQOL endpoints more challenging to analyze than other longitudinal measures are:
Complex correlation structure
Temporal patterns induced by repeated evaluations
Greater subjectivity of these measurements
Multidimensionality
Greater potential for informative dropouts
Thus, many statistical analytic issues need to be addressed. The majority of these are important in the regulatory setting. Such issues include
False positive error inflation
Appropriateness of univariate versus multivariate approaches
Impact of dropouts (particularly when informative)
Missing data imputation techniques
Longitudinal analysis and correlation structure
Robustness of modeling and imputation methods
When evaluating a HRQOL claim for product approval, the same level of rigor pertains as for other efficacy endpoints. Each of these issues will be discussed in the following.
False Positive Error Inflation: This can occur due to multiplicity of endpoints and/or multiple statistical tests performed. HRQOL questionnaires usually contain many individual questions (in excess of 30 in many cases) and these are usually grouped into sets of related questions or domains (five or six domains are common). Thus, it is essential to pre-specify a small number of major HRQOL questions to avoid major multiplicity problems. Adjustment for multiplicity is required. As it is difficult or impossible to impose many of these adjustments post hoc, appropriate endpoint multiplicity adjustment should be pre-specified in the protocol. Another source of false positive error inflation is multiple comparisons of post-baseline values to baseline. In addition to lack of independence of these tests , such a strategy results in false positive error inflation.
Univariate versus Multivariate Analytic Approaches: Univariate techniques are attractive since they are straightforward to implement and yield results that are not difficult to interpret. Such approaches include
Comparison of baseline score to a pre-specified post-baseline value
Endpoint analysis (comparison of baseline to the last recorded value)
Utility score Q-TwiST (Quality Adjusted Time without Symptoms or Toxicity)
Summary measures (patient slopes, Area Under the Curve)
There are limitations to such techniques. They do not enable one to adequately assess the missing data mechanism (informative or not). Endpoint analysis only yields unbiased results if the missing data follow a non-informative mechanism. Such methods do not provide an assessment of the HRQOL score's evolution over time. For example, time to event analyses (i.e., time to a pre-specified deterioration in HRQOL score) may not capture important temporal patterns (e.g., gradual versus abrupt decline). Summary measures should be used cautiously because statistical testing can be sensitive to the measure chosen. In addition, they should not stand alone because they can be subject to bias if follow-up duration differs by treatment group and if the missing data mechanism is informative. Hence, sensitivity analyses are important. Q-TwiST analysis assigns utility scores to different health states (e.g., toxicity, relapse of disease, or death) and employs a competing risks time to event analysis. It is important to have sufficient follow-up before undertaking these analyses because the outcomes may not be relevant to HRQOL in the disease context. With short follow-up time, many subjects will be censored. This leads to large variances of the time to event distributions. There is an element of subjectivity in choosing the utility weights. Sensitivity analysis is needed to demonstrate relative insensitivity to the choice of weights. Multivariate approaches, such as MANOVA, require complete data and somewhat restrictive assumptions (e.g., normality of data). MANOVA is known to be sensitive to non-normal data and outliers, and to have a low breakdown point (i.e., the proportion of outliers that affect the analysis is small).
Missing Data Imputation: Complete case analysis or missing data imputation methods, such as LOCF (last observation carried forward), are often used to obtain a 'complete' data set. Both of these strategies require assuming a non-informative missing data mechanism. Unfortunately, this assumption is not testable. Nonetheless, an in depth investigation of the type of missing data mechanism at work with appropriate sensitivity analyses should be undertaken. The LOCF method assumes a patient's score remains constant. This assumption is untenable in most cases. This leads to underestimation of the variability of the data, and consequent inflation of the false positive error rate. Other simple imputation techniques include worst outcome and worst case approaches. In the worst outcome method, a missing observation in a treatment group is replaced by the worst outcome in that treatment group. The worst case method is most conservative since a missing value in the treatment group is replaced by the worst overall value and a missing value in the control group is replaced by the most favorable value. These methods are biased, but are useful for sensitivity analyses. Another more sophisticated, computer-intensive technique is multiple imputation (MI), which uses observations from patients with complete data to predict values for patients with missing data. One defines classes of data based on other variables (e.g., initial value, and clinically relevant covariables). A set of 'close' observations with complete data (perhaps 10 or 20 cases) is identified. A value to impute is randomly selected and this is done several times generating a data set at each cycle (five is common). A parameter (e.g., a mean) is estimated for each data set. These estimates are combined and we compute the variability due to the imputation and due to the data. The key assumption is that the data are missing completely at random (MCAR) within the strata Again, this is not a testable assumption. Thus, sensitivity analyses are essential with this methodology. No matter what imputation technique is used, it must be emphasized that death is not a missing value; it needs to be scored appropriately. From a regulatory standpoint, informative missingness is a major problem because if hypothesis tests are marginally significant, missing values can change conclusions substantially. In structuring a statistical analysis plan for a pivotal study in a regulatory submission, sponsors should plan for sensitivity analyses for the methods employed. Such strategies explore methods to examine the degree to which the missing value methods affect analytic results (e.g., a change in directionality is a serious concern).
Longitudinal approach: Perhaps with the exception of short-term studies, a longitudinal analysis generally should be an essential component of the statistical evaluation of the HRQOL measures. Longitudinal analyses are needed to characterize temporal patterns, investigate the effects of dropouts, study the influence of baseline covariates on time trends, and place univariate comparisons in context. One approach makes use of growth curve analysis in the context of a pattern mixture model (Little, 1993, 1995) to study missing data mechanisms. These mixed effects models are useful in this setting because the evaluations are often not performed at the specified times and invariably there are missing observations. The growth curve analysis fits polynomial growth curves to each HRQOL component using time as the predictor. Then one examines the mean HRQOL response in the subgroups who did or did not drop out within a time frame constituting minimum adequate treatment. A prospective definition of dropout classes is obtained from clinical input. Sensitivity analyses are needed to assess the robustness of the cut-point chosen. Thus, two strata or homogeneity classes for a modified pattern mixture model are obtained. If the time trend for dropouts is different from that for completers, that provides evidence that the missingness mechanism is informative. In such a case, the investigator cannot ignore the missingness mechanism; she cannot use all of the data to estimate time trends. Completers and dropouts need to be modeled separately. If the two trends are not different, then this suggests the missingness mechanism is noninformative. In that case, we can use all of the data in estimation. Other longitudinal approaches often used include selection models and general estimating equations (GEE) (Zeger, Liang and Albert, 1988). Selection models posit a statistical model for the missing data mechanism. An example is the Schluchter (1992) two-stage mixed effects model. Again, sensitivity analyses are crucial for these models since they are based on untestable assumptions. The GEE approach (also requiring the non-informative missingness assumption) provides a robust estimate of the variance-covariance structure of the data. This method can be used in sensitivity analyses for the growth curve/modified pattern mixture strategy previously described.
Robustness Concerns: Stating major hypotheses and analytic approaches prospectively is essential in the regulatory setting. Robustness of results is also an important concern. Thus, well considered sensitivity analyses need to be undertaken to examine the sensitivity of analytic findings to the missing data imputation algorithm chosen and the modeling approach undertaken. Substantial changes in significance or directionality are the major concern.
A Sensitivity Study:
DeMetri, et al (1998) studied the Quality of Life in a variety of cancers using the FACT-An scale and estimated the correlation between the HRQOL change from baseline with change in the hemoglobin level following administration of erythropoeitin. We consider the difference between HRQOL only. They obtained baseline measurements of HRQOL from 2289 patients with various cancers. They note that about 35% did not have a final measurement due to death (223), progressive disease (136), loss to follow-up (129), and adverse events (35), among other reasons. There were 1484 patients with both baseline and follow-up. The authors report a p-value for the difference in HRQOL scores of 0.001. If the reason for lack of a final observation were unrelated to outcome, this would be appropriate. However, death, progressive disease and adverse events do not suggest that the missing final observation is independent of outcome. The authors report a baseline value of 113.2 with a standard deviation of 29.7. If we assume the increase in HRQOL is 3, we can use the formula for the paired t-test to get a rough estimate of the correlation between baseline and final levels. That is, t=d-/(sd/Ö1484) and sd =sÖ2(1-r). Thus, knowing that t»3.29 we can solve for r given d-. Let's assume d- =3. This would imply that r»0.3. Let us consider how we might perform a sensitivity analysis for the patients without a final observation.
Let the proportion of patients who have complete observations be p. If these values are also normally distributed with the same standard deviation and correlation, we will have
m = p*mcomplete +(1-p)*mincomplete
and variance = p*s2complete +(1-p)s2incomplete +p*(1-p)*(mcomplete -mincomplete)2
The table below gives values of the mean and standard deviation of the difference for various assumptions about mincomplete , using s=29.7:
mincomplete |
113 |
110 |
93 |
83 |
Mean difference |
1.95 |
0.9 |
-5.05 |
-8.55 |
sdifference |
29.7 |
29.8 |
31.7 |
33.6 |
t |
3.14 |
1.44 |
-7.63 |
-12.17 |
Thus, only if the dropouts have no difference from baseline, is significance maintained. The highly significant outcome ignoring the dropouts is sensitive to the assumption of no difference among dropouts and completers. The paired t-test is affected by the mean difference being reduced and by the variance being increased.
Generally, analyses should account for all subjects in a study since dropouts and protocol violations usually are a non-random subset of the group being studied.
An alternative model:
Lachenbruch (2000) studied a two-part model that compared the proportion of patients who dropped out using a binomial test and the difference in means using a t-test. This can be extended to using a Wilcoxon. In this example, we do not have the original data, so cannot use the Wilcoxon. This leads to a two degree of freedom c2 test of the form B2 + Z2. In this example, suppose we wish to test the combined null hypothesis H0 : p=0.6 and md=0. Then the statistic is X2 =5.3962 + 3.292 =39.94, which has a p-value less than 0.001. In this case, we would note that the product seems to improve those who complete, but that we have a higher than hypothesized proportion of dropouts.
We can easily generalize this to multiple reasons for not completing the study - say we wished to consider death, medical reasons not including death, other. Each such category adds a degree of freedom to the overall c2. Suppose the combined null hypothesis is H0: pdeath=0.1, pmedical=0.19, pother=0.06, md=0. In this case, X2=0.152+1.334+1.497+10.824=13.81 which has a p-value of 0.008. The X2 due to the dropouts has a p-value of 0.394. Note the null hypotheses values were made up. For a sample size of 2289, small changes in the null hypothesis values can lead to large changes in the X2.
Conclusions:
Recently, the FDA has proposed the term Patient Reported Outcomes (PRO) as an alternative to HRQOL because it is a descriptive term of the outcome being studied. Whether this term gains acceptance remains to be seen. We provide some bullet points that summarize our main concerns in this article. This ordering is not intended to imply any value judgement (nor has this scale been validated or found reliable).
There should be an a priori plan for analyzing the study, including explicit criteria for "success."
Studies should be randomized and double-blind to remove or reduce effects of treatment knowledge on the response.
The HRQOL scale should be validated in the population/study group to which it is being applied. Both reliability and validity measures will be needed.
Measurements should be timed to reflect the disease context.
Missing values must be proactively minimized, and plans should be in place to evaluate the sensitivity of the results to missing observations.
Address multiplicity issues.
Talk to the regulatory agency before committing to a clinical trial to ensure it will be acceptable for product approval. In particular, discuss plans for any claims that will be made for the product.
References:
Berkanovic, E., Hurwicz, M. L. (1990) "Rheumatoid Arthritis and Comorbidity" J. Rheumatology 17(7) 888-892
DeMetri, G. D., Kris, M., Wade, J., Degos, L. and Cella, D. for the Procrit Study Group (1998) J. Clin. Oncol. 16 (10) October 3412-3425
Hurwicz, M. L., Berkanovic, E. (1993) "The Stress Process in Rheumatoid Arthritis" J. Rheumatology 20(11) 1836-1844
Lachenbruch, P. A. (2000) "Comparisons of two-part models with competitors" Statistics in Medicine, in press.
Little, R. J. A. (1993) "Pattern Mixture Models for Multivariate Incomplete Data" J. Amer. Stat. Assoc. 88 125-134
Little, R. J .A. (1995) "Modeling the drop-out mechanism in repeated-measures studies" J. Amer Stat. Assoc. 90 1112-1121
Schluchter, M. D. (1992) "Methods for the Analysis of Informatively Censored Longitudinal Data" Statistics In Medicine 11 1861-1870
Staquet, M. J., Hays, R. D., Fayers, P. M. (1998) Quality of Life Assessment in Clinical Trials: Methods and Practice Oxford: Oxford University Press
Zeger, S. L., Liang, K. Y., and Albert, P. S. (1988) "Models for Longitudinal Data: A Generalized Estimating Equations Approach" Biometrics 44 1049-1060