7  Data Checking

flowchart LR
Ran[Range Checks]
Con["Cross-Variable Consistency"]
Rep[Checking and Reporting<br>With Minimal Coding] --> Li[Listings] & S[Summaries]
Ran --> Rep
Con --> Rep

Besides useful descriptive statistics exemplified below, it is important to flag suspicious values in an automated way. Since checking multiple columns may involve a large number of R expressions to run to classify observations as suspicious, let’s automate the process somewhat by specifying a vector of expressions. Then we have R “compute on the language” to parse the expressions for finding observations to flag, and for printing. This is done by the dataChk function in Github.

The following code results in separate output for each individual data check, in separate Quarto tabs. The dataset does not have a subject ID variable so let’s create one, and also add a site variable to print. Arguments are specified to dataChk so that no tab is produced for a condition that never occurred in the data, and a tab is produced showing all data flags, sorted by id and site.

require(Hmisc)
require(data.table)
getRs('reptools.r')
getHdata(stressEcho)
w <- stressEcho
setDT(w)
w[, id := 1 : .N]
set.seed(1)
w[, site := sample(LETTERS[1:6], .N, replace=TRUE)]
checks <- expression(
  age < 30 | age > 90,
  gender == 'female' & maxhr > 170,
  baseEF %between% c(72, 77),
  baseEF > 77,
  baseEF > 99,
  sbp > 250 & maxhr < 160)
dataChk(w, checks, id=c('id', 'site'),
        omit0=TRUE, byid=TRUE, html=TRUE)
     id site age 
 1:  14    B  29 
 2:  23    F  91 
 3:  30    D  26 
 4:  60    A  91 
 5:  64    D  92 
 6: 116    A  28 
 7: 235    F  91 
 8: 259    D  29 
 9: 313    C  93 
 
     id site gender maxhr 
 1:  11    A female   171 
 2:  89    B female   182 
 3: 412    E female   200 
 
     id site baseEF 
 1:  56    B     72 
 2: 200    A     75 
 3: 272    A     72 
 4: 366    E     74 
 5: 406    B     77 
 6: 433    A     74 
 7: 495    D     72 
 8: 496    E     74 
 
     id site baseEF 
 1: 299    D     79 
 2: 434    E     83 
 
     id site sbp maxhr 
 1:  51    B 309   146 
 2: 146    E 283   135 
 3: 353    D 274   117 
 
      id site                            Check     Values 
  1:  11    A gender == "female" & maxhr > 170 female 171 
  2:  14    B              age < 30 | age > 90         29 
  3:  23    F              age < 30 | age > 90         91 
  4:  30    D              age < 30 | age > 90         26 
  5:  51    B          sbp > 250 & maxhr < 160    309 146 
  6:  56    B                  baseEF [72, 77]         72 
  7:  60    A              age < 30 | age > 90         91 
  8:  64    D              age < 30 | age > 90         92 
  9:  89    B gender == "female" & maxhr > 170 female 182 
 10: 116    A              age < 30 | age > 90         28 
 11: 146    E          sbp > 250 & maxhr < 160    283 135 
 12: 200    A                  baseEF [72, 77]         75 
 13: 235    F              age < 30 | age > 90         91 
 14: 259    D              age < 30 | age > 90         29 
 15: 272    A                  baseEF [72, 77]         72 
 16: 299    D                      baseEF > 77         79 
 17: 313    C              age < 30 | age > 90         93 
 18: 353    D          sbp > 250 & maxhr < 160    274 117 
 19: 366    E                  baseEF [72, 77]         74 
 20: 406    B                  baseEF [72, 77]         77 
 21: 412    E gender == "female" & maxhr > 170 female 200 
 22: 433    A                  baseEF [72, 77]         74 
 23: 434    E                      baseEF > 77         83 
 24: 495    D                  baseEF [72, 77]         72 
 25: 496    E                  baseEF [72, 77]         74 
      id site                            Check     Values 
 
                              Check n 
 1              age < 30 | age > 90 9 
 2 gender == "female" & maxhr > 170 3 
 3                  baseEF [72, 77] 8 
 4                      baseEF > 77 2 
 5                      baseEF > 99 0 
 6          sbp > 250 & maxhr < 160 3