Before using a measurement instrument or diagnostic technique routinely, a researcher may wish to quantify the extent to which two determinations of the measurement, made by two different observers or measurement devices, disagree (interobserver variability). She may also wish to quantify the repeatability of one observer in making the measurement at different times (intraobserver variability). To make these assessments, she has each observer make the measurement for each of a number of experimental units (e.g., subjects).
The measurements being analyzed may be continuous, ordinal, or binary (yes/no). Ordinal measurements must be coded such that distances between values reflects the relative importance of disagreement. For example, if a measurement has the values 1, 2, 3 for poor, fair, good, it is assumed that “good” is as different from “fair” as “fair” is from “poor”. if this is not the case, a different coding should be used, such as coding 0 for “poor” if poor should be twice as far from “fair” as “fair” is from “good”. Measurements that are yes/no or positive/negative should be coded as 1 or 0. The reason for this will be seen below.
There are many statistical methods for quantifying inter and intraobserver variability. Correlation coefficients are frequently reported, but a perfect correlation can result even when the measurements disagree by a factor of 10. Variance components analysis and intraclass correlation are often used, but these make many assumptions, do not handle missing data very well, and are difficult to interpret. Some analysts, in assessing interobserver agreement when each observer makes several determinations, compute differences between the average determinations for each observer. This method clearly yields a biased measurement of interobserver agreement because it cancels the intraobserver variability.
A general and descriptive method for assessing observer variability will now be presented. The methods uses a general type of statistic called the \(U\) statistic, invented by Hoeffding Hoeffding (1948)^{1}. For definiteness, an analysis for 3 observers and 2 readings per observer will be shown. When designing such a study, the researcher should remember that the number of experximental units is usually the critical factor in determining the precision of estimates. There is not much to gain from having each observer make more than a few readings or from having 30 observers in the study (although if few observers are used, these are assumed to be “typical” observers).
^{1} The Wilcoxon test and the \(c\)index are other examples of \(U\) statistics.
The intraobserver disagreement for a single subject or unit is defined as the average of the intraobserver absolute measurement differences. In other words, intraobserver disagreement is the average absolute difference between any two measurements from the same observer. The interobserver disagreement for one unit is defined as the average absolute difference between any two readings from different observers. Disagreement measures are computed separately for each unit and combined over units (by taking the mean or median for example) to get an overall summary measure. Units having more readings get more weight. When a reading is missing, that reading does not enter into any calculation and the denominator used in finding the mean disagreement is reduced by one.
Suppose that for one patient, observers A, B, and C make the following determinations on two separate occasions, all on the same patient:
A
B
C
5,7
8,5
6,7
For that patient, the mean intraobserver difference is \((57 + 85 + 67)/3 = \frac{2+3+1}{3} = 2\).
The mean interobserver difference is \((58 + 55 + 56+ 57 + 78 + 75 + 76 + 77+\)\(86 + 87 + 56 + 57)/12 =\)\((3+0+1+2+1+2+1+0+2+1+1+2)/12 =\)\(\frac{16}{12} = 1.33\).
If the first reading for observer A were unobtainable, the mean intraobserver difference for that patient would be \((85 + 67)/2 = \frac{3+1}{2} = 2\) and the mean interobserver difference would be \((78 + 75 + 76 + 77 + 86 + 87 + 56 + 57)/8 =\)\((1+2+1+0+2+1+1+2)/8 = \frac{10}{8} = 1.25\).
The computations are carried out in like manner for each patient and summarized as follows:
Patient
Intraobserver
Interobserver
Difference
Difference
1
2.00
1.33
2
1.00
3.50
3
1.50
2.66
.
.
.
.
.
.
\(n\)
.
.
Overall Average (or median)
1.77
2.23
\(Q_{1}\)
0.30
0.38
\(Q_{3}\)
2.15
2.84
Here is an example using R to compute mean inter and intraobserver absolute differences for 4 subjects each assessed twice by each of 3 observers. The first subject consists of data above. The calculations are first done for the first subject alone, to check against computations above.
rep observer subject y
1 1 A 1 5
2 2 A 1 7
3 1 B 1 8
4 2 B 1 5
5 1 C 1 6
6 2 C 1 7
7 1 A 2 7
8 2 A 2 6
9 1 B 2 8
10 2 B 2 6
11 1 C 2 9
12 2 C 2 7
13 1 A 3 7
14 2 A 3 5
15 1 B 3 4
16 2 B 3 6
17 1 C 3 10
18 2 C 3 11
19 1 A 4 7
20 2 A 4 6
21 1 B 4 5
22 2 B 4 6
23 1 C 4 9
24 2 C 4 8
nintra intra ninter inter
3.000000 2.000000 12.000000 1.333333
Code
# Compute for all subjectswith(d, mad(y, observer, subject))
nintra intra ninter inter
12.000000 1.583333 48.000000 2.125000
Zhouwen Liu in the Vanderbilt Department of Biostatistics has developed much more general purpose software for this in R. Its web pages are biostat.app.vumc.org/AnalysisOfObserverVariability and github.com/harrelfe/rscripts. The following example loads the source code and runs the above example. The R functions implement bootstrap nonparametric percentile confidence limits for mean absolute discrepency measures.
Code
require(Hmisc)getRs('observerVariability.r')
YOU ARE RUNNING A DEMO VERSION 3_2
Code
with(d, { intra <intraVar(subject, observer, y)print(intra)summary(intra)set.seed(2) b=bootStrap(intra, by ='subject', times=1000)# Get 0.95 CL for mean absolute intraobserver differenceprint(quantile(b, c(0.025, 0.975))) inter <interVar(subject, observer, y)print(inter)summary(inter) b <bootStrap(inter, by ='subject', times=1000)# Get 0.95 CL for mean absolute interobserver differenceprint(quantile(b, c(0.025, 0.975)))})
# To load a demo file into an RStudio script editor window, type# getRs('observerVariability_example.r', put='rstudio')
From the above output, the 0.95 CL for the mean absolute intraobserver difference is \([1.17, 1.92]\) and is \([1.33, 3.21]\) for the interobserver difference. The bootstrap confidence intervals use the cluster bootstrap to account for correlations of multiple readings from the same subject.
When the measurement of interest is a yes/no determination such as presence or absence of a disease these difference statistics are generalizations of the fraction of units in which there is exact agreement in the yes/no determination, when the absolute differences are summarized by averaging. To see this, consider the following data with only one observer:
Patient
Determinations
\(D_{1}, D_{2}\)
Agreement?
\(D_{1} D_{2}\)
1
Y Y
1 1
Y
0
2
Y N
1 0
N
1
3
N Y
0 1
N
1
4
N N
0 0
Y
0
5
N N
0 0
Y
0
6
Y N
1 0
N
1
The average \(D_{1}  D_{2}\) is \(\frac{3}{6} = 0.5\) which is equal to the proportion of cases in which the two readings disagree.
An advantage of this method of summarizing observer differences is that the investigator can judge what is an acceptable difference and he can relate this directly to the summary disagreement statistic.
16.2 Comparison of Measurements with a Standard
When the true measurement is known for each unit (or the true diagnosis is known for each patient), similar calculations can he used to quantify the extent of errors in the measurements. For each unit, the average (over observers) difference from the true value is computed and these differences are summarized over the units. For example, if for unit #1 observer A measures 5 and 7, observer 8 measured 8 and 5, and the true value is 6, the average absolute error is \((56+76+ 86+56)/4 = \frac{1+1+2+1}{4} = \frac{5}{4} = 1.25\)
16.3 Special Case: Assessing Agreement In Two Binary Variables
16.3.1 Measuring Agreement Between Two Observers
Suppose that each of \(n\) patients undergoes two diagnostic tests that can yield only the values positive and negative. The data can be summarized in the following frequency table.
Test 2
+ 
Test 1
+
a b
g

c d
h
e f
n
An estimate of the probability that the two tests agree is \(p_{A}=\frac{a+d}{n}\). An approximate 0.95 confidence interval for the true probability is derived from \(p_{A} \pm 1.96 \sqrt{p_{A} (1  p_{A})/n}\)~^{2} If the disease being tested is very rare or very common, the two tests will agree with high probability by chance alone. The \(\kappa\) statistic is one way to measure agreement that is corrected for chance agreement. \[\kappa = \frac{p_{A}  p_{C}}{1  p_{C}}\]
^{2} A more accurate confidence interval can be obtained using Wilson’s method as provided by the RHmisc package binconf function.
where \(p_{C}\) is the expected agreement proportion if the two observers are completely independent. The statistic can be simplified to \[\kappa = \frac{2 (ad  bc)}{gf + eh}.\]
It the two tests are in perfect agreement, \(\kappa=1\). If the two agree at the level expected by chance, \(\kappa=0\). If the level of agreement is less than one would obtain by chance alone, \(\kappa < 0\).
A formal test of significance of the difference in the probabilities of for the two tests is obtained using McNemar’s test. The null hypothesis is that the probability of + for test 1 is equal to the probability of + for test 2, or equivalently that the probability of observing a \(+\) is the same as that of observing \(+\). The normal deviate test statistic is given by \[z = \frac{b  c}{\sqrt{b + c}}.\]
16.3.2 Measuring Agreement Between One Observer and a Standard
Suppose that each of n patients is studied with a diagnostic test and that the true diagnosis is determined, resulting in the following frequency table:
Diagnosis
+ 
Test
+
a b
g

c d
h
e f
n
The following measures are frequently used to describe the agreement between the test and the true diagnosis. Here \(T^{+}\) denotes a positive test, \(D^{}\) denotes no disease, etc.
Quantity
Probability Being Estimated
Formula
Correct diagnosis probability
Prob\((T = D)\)
\(\frac{a+d}{n}\)
Sensitivity
Prob\((T^{+}  D^{+})\)
\(\frac{a}{e}\)
Specificity
Prob\((T^{}  D^{})\)
\(\frac{d}{f}\)
Accuracy of a positive test
Prob\((D^{+}  T_{+})\)
\(\frac{a}{g}\)
Accuracy of a negative test
Prob\((D^{}  T_{})\)
\(\frac{d}{h}\)
The first and last two measures are usually preferred. Note that when the disease is very rare or very common, the correct diagnosis probability will be high by chance alone. Since the sensitivity and specificity are calculated conditional on the diagnosis, the prevalence of disease does not directly affect these measures. But sensitivity and specificity will vary with every patient characteristic related to the actual ignored severity of disease.
When estimating any of these quantities, Wilson confidence intervals are useful adjunct statistics. A less accurate 0.95 confidence interval is obtained from \(p \pm 1.96\sqrt{\frac{p(1p)}{n}}\) where \(p\) is the proportion and \(m\) is its denominator.
16.4 Problems
Three technicians, using different machines, make 3 readings each. For the data that follow, calculate estimates of inter and intratechnician discrepancy.
Technician
1
2
3
Reading
Reading
Reading
1 2 3
1 2 3
1 2 3
18 17 14
16 15 16
12 15 12
20 21 20
14 12
13
26 20 23
18 20
22 24
19 17
16
21 23
28 24
32 29
29 25
Fortyone patients each receive two tests yielding the frequency table shown below. Calculate a measure of agreement (or disagreement) along with an associated 0.95 confidence interval. Also calculate a chancecorrected measure of agreement. Test the null hypothesis that the the tests have the same probability of being positive and the same probability of being negative. In other words, test the hypothesis that the chance of observing \(+\) is the same as observing \(+\).
Test 2
+ 
Test 1
+
29 8

0 4
16.5 References
Landis JR, Koch GG: A review of statistical methods in the analysis of data arising from observer reliability studies (Part II), 29:151619 1975.
Landis JR, Koch GG: An application of hierarchical \(\kappa\)type statistics in the assessment of majority agreement among multiple observers. Biometrics33:36374, 1977.
Hoeffding, W. (1948). A class of statistics with asymptotically normal distributions. Ann Math Stat, 19, 293–325.
Partially reprinted in: Kotz, S., Johnson, N.L. (1992) Breakthroughs in Statistics, Vol I, pp 308334. SpringerVerlag. ISBN 0387940375
<! Converted obsVar.pdf using newocr.com># Analysis of Observer Variability and Measurement Agreement {#secobsvar}[This chapter was written by FE Harrell, 1987]{.aside}## Intra and Interobserver DisagreementBefore using a measurement instrument or diagnostic techniqueroutinely, a researcher may wish to quantify the extent to which twodeterminations of the measurement, made by two different observersor measurement devices, disagree (interobserver variability). Shemay also wish to quantify the repeatability of one observer in makingthe measurement at different times (intraobserver variability). Tomake these assessments, she has each observer make the measurementfor each of a number of experimental units (e.g., subjects).The measurements being analyzed may be continuous, ordinal, or binary(yes/no). Ordinal measurements must be coded such that distances betweenvalues reflects the relative importance of disagreement. For example, if ameasurement has the values 1, 2, 3 for poor, fair, good, it is assumed that"good" is as different from "fair" as "fair" is from"poor". if this is not the case, a different coding should be used,such as coding 0 for "poor" if poor should be twice as far from"fair" as "fair" is from "good". Measurements that are yes/no orpositive/negative should be coded as 1 or 0. The reason for this willbe seen below.There are many statistical methods for quantifying inter andintraobserver variability. Correlation coefficients are frequently reported,but a perfect correlation can result even when the measurements disagree by afactor of 10. Variance components analysis and intraclasscorrelation are often used, but these make many assumptions, do nothandle missing data very well, and are difficult to interpret. Someanalysts, in assessinginterobserver agreement when each observer makes several determinations,compute differences between the average determinations for each observer.This method clearly yields a biased measurement of interobserver agreementbecause it cancels the intraobserver variability.A general and descriptive method for assessing observer variabilitywill now be presented. The methods uses a general type of statisticcalled the $U$ statistic, invented byHoeffding @hoe48cla^[The Wilcoxon test and the $c$index are other examples of $U$ statistics.]. For definiteness, an analysis for 3 observers and 2 readings per observer will beshown. When designing such a study, the researcher should rememberthat the number of experximental units is usually the critical factorin determining the precision of estimates. There is not much to gainfrom having each observer make more than a few readings or from having30 observers in the study (although if few observers are used, theseare assumed to be "typical" observers).The intraobserver disagreement for a single subject or unit is definedas the average of the intraobserver absolute measurement differences. Inother words, intraobserver disagreement is the average absolute differencebetween any two measurements from the same observer. The interobserverdisagreement for one unit is defined as the average absolute differencebetween any two readings from different observers. Disagreement measures arecomputed separately for each unit and combined over units (by taking the meanor median for example) to get an overall summary measure. Unitshaving more readings get more weight. When a reading ismissing, that reading does not enter into any calculation and the denominatorused in finding the mean disagreement is reduced by one.Suppose that for one patient, observers A, B, and C make the followingdeterminations on two separate occasions, all on thesame patient: A  B  C  5,7  8,5  6,7 For that patient, the mean intraobserver difference is $(57 +85 + 67)/3 = \frac{2+3+1}{3} = 2$.The mean interobserver difference is$(58 + 55 + 56+ 57 + 78 + 75 + 76 + 77+$ $86+ 87 + 56 + 57)/12 =$ $(3+0+1+2+1+2+1+0+2+1+1+2)/12 =$$\frac{16}{12} = 1.33$.If the first reading for observer A were unobtainable, the meanintraobserver difference for that patient would be $(85 + 67)/2= \frac{3+1}{2} = 2$ and the mean interobserver difference would be $(78+ 75 + 76 + 77 + 86 + 87 + 56 + 57)/8 =$ $(1+2+1+0+2+1+1+2)/8 = \frac{10}{8} = 1.25$.The computations are carried out in like manner for each patient andsummarized as follows: Patient  Intraobserver  Interobserver   Difference  Difference  1  2.00  1.33  2  1.00  3.50  3  1.50  2.66  .  .  .  .  .  .  $n$  .  .  **Overall Average** (or median)  1.77  2.23  $Q_{1}$  0.30  0.38  $Q_{3}$  2.15  2.84 Here is an example using `R` to compute mean inter andintraobserver absolute differences for 4 subjects each assessedtwice by each of 3 observers. The first subject consists of dataabove. The calculations are first done for the first subject alone,to check against computations above.```{r rex}d <expand.grid(rep=1:2, observer=c('A','B','C'), subject=1:4)d$y <c(5,7, 8,5, 6,7,7,6, 8,6, 9,7,7,5, 4,6, 10,11,7,6, 5,6, 9,8)d# Function to compute mean absolute discrepanciesmad <function(y, obs, subj) { nintra < ninter < sumintra < suminter <0 n <length(y)for(i in1: (n 1)) {for(j in (i +1) : n) {if(subj[i] == subj[j]) { dif <abs(y[i]  y[j])if(!is.na(dif)) {if(obs[i] == obs[j]) { nintra < nintra +1 sumintra < sumintra + dif }else { ninter < ninter +1 suminter < suminter + dif } } } } }c(nintra=nintra, intra=sumintra / nintra,ninter=ninter, inter=suminter / ninter)}# Compute statistics for first subjectwith(subset(d, subject ==1), mad(y, observer, subject))# Compute for all subjectswith(d, mad(y, observer, subject))```Zhouwen Liu in the Vanderbilt Department of Biostatistics hasdeveloped much more general purpose software for this in `R`. Its webpages are[biostat.app.vumc.org/AnalysisOfObserverVariability](http://biostat.app.vumc.org/AnalysisOfObserverVariability)and [github.com/harrelfe/rscripts](https://github.com/harrelfe/rscripts).The following example loads the source code and runs the aboveexample. The `R` functions implement bootstrap nonparametricpercentile confidence limits for mean absolute discrepency measures.```{r robs}require(Hmisc)getRs('observerVariability.r')with(d, { intra <intraVar(subject, observer, y)print(intra)summary(intra)set.seed(2) b=bootStrap(intra, by ='subject', times=1000)# Get 0.95 CL for mean absolute intraobserver differenceprint(quantile(b, c(0.025, 0.975))) inter <interVar(subject, observer, y)print(inter)summary(inter) b <bootStrap(inter, by ='subject', times=1000)# Get 0.95 CL for mean absolute interobserver differenceprint(quantile(b, c(0.025, 0.975)))})# To load a demo file into an RStudio script editor window, type# getRs('observerVariability_example.r', put='rstudio')```From the above output, the 0.95 CL for the mean absoluteintraobserver difference is $[1.17, 1.92]$ and is $[1.33, 3.21]$ for theinterobserver difference. The bootstrap confidence intervals use thecluster bootstrap to account for correlations of multiple readingsfrom the same subject.When the measurement of interest is a yes/no determination such aspresence or absence of a disease these difference statistics are generalizations of the fraction of units in which there is exact agreement in the yes/nodetermination, when the absolute differences are summarized by averaging. Tosee this, consider the following data with only one observer: Patient  Determinations  $D_{1}, D_{2}$  Agreement?  $D_{1} D_{2}$  1 Y Y 1 1Y0  2 Y N 1 0N1  3 N Y 0 1N1  4 N N 0 0Y0  5 N N 0 0Y0  6 Y N 1 0N1 The average $D_{1}  D_{2}$ is $\frac{3}{6} = 0.5$ which is equal to theproportion of cases in which the two readings disagree.An advantage of this method of summarizing observer differences is thatthe investigator can judge what is an acceptable difference and he can relatethis directly to the summary disagreement statistic.## Comparison of Measurements with a StandardWhen the true measurement is known for each unit (or the true diagnosisis known for each patient), similar calculations can he used to quantify theextent of errors in the measurements. For each unit, the average (overobservers) difference from the true value is computed and these differencesare summarized over the units. For example, if for unit \#1 observer A measures 5 and 7, observer 8 measured 8 and 5, and the true value is 6, theaverage absolute error is $(56+76+ 86+56)/4 = \frac{1+1+2+1}{4}= \frac{5}{4} = 1.25$## Special Case: Assessing Agreement In Two Binary Variables### Measuring Agreement Between Two ObserversSuppose that each of $n$ patients undergoes two diagnostic tests that canyield only the values positive and negative. The data can be summarized inthe following frequency table.   Test 2     +    Test 1  +  a b  g     c d  h    e f  n An estimate of the probability that the two tests agree is$p_{A}=\frac{a+d}{n}$. An approximate 0.95confidence interval for the true probability is derived from$p_{A} \pm 1.96 \sqrt{p_{A} (1  p_{A})/n}$~^[A more accurate confidence interval can be obtained using Wilson's method as provided by the `R` `Hmisc` package `binconf` function.]If the disease being tested is very rare or very common, the two testswill agree with high probability by chance alone. The $\kappa$ statistic is oneway to measure agreement that is corrected for chance agreement.$$\kappa = \frac{p_{A}  p_{C}}{1  p_{C}}$$where $p_{C}$ is the expected agreement proportion if the two observers arecompletely independent. The statistic can be simplified to$$\kappa = \frac{2 (ad  bc)}{gf + eh}.$$It the two tests are in perfect agreement, $\kappa=1$. If the two agree at thelevelexpected by chance, $\kappa=0$. If the level of agreement is less than one wouldobtain by chance alone, $\kappa < 0$.A formal test of significance of the difference in the probabilities offor the two tests is obtained using McNemar's test. The null hypothesis isthat the probability of + for test 1 is equal to the probability of + for test2, or equivalently that the probability of observing a $+$ is the same as thatof observing $+$. The normal deviate test statistic is given by$$z = \frac{b  c}{\sqrt{b + c}}.$$### Measuring Agreement Between One Observer and a StandardSuppose that each of n patients is studied with a diagnostic test andthat the true diagnosis is determined, resulting in the following frequencytable:   Diagnosis     +    Test  +  a b  g     c d  h    e f  n The following measures are frequently used to describe the agreement betweenthe test and the true diagnosis. Here $T^{+}$ denotes a positive test, $D^{}$denotes no disease, etc. Quantity  Probability Being Estimated  Formula  Correct diagnosis probability  Prob$(T = D)$  $\frac{a+d}{n}$  Sensitivity  Prob$(T^{+}  D^{+})$  $\frac{a}{e}$  Specificity  Prob$(T^{}  D^{})$  $\frac{d}{f}$  Accuracy of a positive test  Prob$(D^{+}  T_{+})$  $\frac{a}{g}$  Accuracy of a negative test  Prob$(D^{}  T_{})$  $\frac{d}{h}$ The first and last two measures are usually preferred.Note that when the disease is very rare or very common, the correctdiagnosis probability will be high by chance alone. Since the sensitivity andspecificity are calculated conditional on the diagnosis, the prevalence ofdisease does not directly affect these measures. But sensitivity andspecificity will vary with every patient characteristic related to the actualignored severity of disease.When estimating any of these quantities, Wilson confidence intervals are usefuladjunct statistics. A less accurate 0.95 confidence interval is obtainedfrom $p \pm 1.96\sqrt{\frac{p(1p)}{n}}$ where $p$ is the proportionand $m$ is its denominator.## Problems1. Three technicians, using different machines, make 3 readings each. For the data that follow, calculate estimates of inter and intratechnician discrepancy.  Technician   1  2  3  Reading  Reading  Reading  1 2 3  1 2 3  1 2 3  18 17 14  16 15 16  12 15 12  20 21 20  14 12  13  26 20 23  18 20  22 24  19 17  16  21 23  28 24  32 29  29 25 1. Fortyone patients each receive two tests yielding the frequency table shown below. Calculate a measure of agreement (or disagreement) along with an associated 0.95 confidence interval. Also calculate a chancecorrected measure of agreement. Test the null hypothesis that the the tests have the same probability of being positive and the same probability of being negative. In other words, test the hypothesis that the chance of observing $+$ is the same as observing $+$.   Test 2    +   Test 1  +  29 8     0 4 ## ReferencesLandis JR, Koch GG: A review of statistical methods in the analysis ofdata arising from observer reliability studies (Part II), \emph{StatisticaNeerlandica} **29**:151619 1975.Landis JR, Koch GG: An application of hierarchical $\kappa$type statisticsin the assessment of majority agreement among multiple observers._Biometrics_ **33**:36374, 1977.