There is much controversy about the need for, definition of, and timing of external validation. To that end we have asked the authors of the two leading textbooks in the field, co-author Frank Harrell (Regression Modeling Strategies), and Ewout Steyerberg of Erasmus University, Rotterdam (Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating) to comment, resulting in the following three paragraphs.
A prognostic model should be valid outside the specifics of the sample where the model is developed. Ideally, a model is shown to predict accurately across a wide range of settings (Justice et al, Ann Int Med 1999). Evidence of such external validity requires evaluation by different research groups and may take several years. Researchers frequently make the mistake of labeling data splitting from a single sequence of patients as external validation when in fact this is a particularly low-precision form of internal validation better done using resampling (see below). On the other hand, external validation carried out by by splitting in time (temporal validation) or by place, is better replaced by considering interactions in the full dataset. For example, if a model developed on Canadians is found to be poorly calibrated for Germans, it is far better to develop an international model with country as one of the predictors. This implies that a researcher with access to data is always better off to analyze and publish a model developed on the full set. That leaves external validation using (1) newly collected data, not available at the time of development; and (2) other investigators, at other sites, having access to other data. (2) has been promoted by Justice as the strongest form of external validation. This phase is only relevant once internal validity has been shown for the developed model. But again, if such data were available at analysis time, those data are too valuable not to use in model development.
Even in the small subset of studies comprising truly external validations, it is a common misconception that the validation statistics are precise. Many if not most external validations are unreliable due to instability in the estimate of predictive accuracy. This instability comes from two sources: the size of the validation sample, and the constitution of the validation sample. The former is easy to envision, while the latter is more subtle. In one example, Frank Harrell analyzed 17,000 ICU patients with 1/3 of patients dying, splitting the dataset into two halves - a training sample and a validation sample. He found that the validation c-index (ROC area) changed substantially when the 17,000 patients were re-allocated at random into a new training and test sample and the entire process repeated. Thus it can take quite a large external sample to yield reliable estimates and to "beat" strong internal validation using resampling. Thus we feel there is great utility in using strong internal validation.
At the time of model development, researchers should focus on showing internal validity of the model they propose, i.e. validity of the model for the setting that they consider. Estimates of model performance are usually optimistic. The optimism can efficiently be quantified by a resampling procedure called the bootstrap, and the optimism can be subtracted out to obtain an unbiased estimate of future performance of the model on the same types of patients. The bootstrap, which enjoys a strong reputation in data analysis, entails drawing patients from the development sample with replacement. It allows one to estimate the likely future performance of a predictive model without waiting for new data to perform a external validation study. It is important that the bootstrap model validation be done rigorously. This means that all analytical steps that use the outcome variable are repeated in each bootstrap sample. In this way, the proper price is paid for any statistical assessments to determine the final model, such as choosing variables and estimating regression coefficients. When the resampling allows models and coefficients to disagree with themselves over hundreds of resamples, the proper price is paid for "data dredging", so that clinical utility (useful predictive discrimination) is not claimed for what is in fact overfitting (fitting "noise"). The bootstrap makes optimal use of the available data: it uses all data to develop the model and all data to internally validate the model, detecting any overfitting. One can call properly penalized boostrapping rigorous or strong internal validation.