7 Data Checking
Besides useful descriptive statistics exemplified below, it is important to flag suspicious values in an automated way. Since checking multiple columns may involve a large number of R expressions to run to classify observations as suspicious, let’s automate the process somewhat by specifying a vector of expressions. Then we have R “compute on the language” to parse the expressions for finding observations to flag, and for printing. This is done by the dataChk
function in qreport
.
The following code results in separate output for each individual data check, in separate Quarto
tabs. The dataset does not have a subject ID variable so let’s create one, and also add a site
variable to print. Arguments are specified to dataChk
so that no tab is produced for a condition that never occurred in the data, and a tab is produced showing all data flags, sorted by id
and site
.
require(Hmisc)
require(data.table)
require(qreport)
getHdata(stressEcho)
<- stressEcho
w setDT(w)
:= 1 : .N]
w[, id set.seed(1)
:= sample(LETTERS[1:6], .N, replace=TRUE)]
w[, site <- expression(
checks < 30 | age > 90,
age == 'female' & maxhr > 170,
gender %between% c(72, 77),
baseEF > 77,
baseEF > 99,
baseEF > 250 & maxhr < 160)
sbp dataChk(w, checks, id=c('id', 'site'),
omit0=TRUE, byid=TRUE, html=TRUE)
id site age 1: 14 B 29 2: 23 F 91 3: 30 D 26 4: 60 A 91 5: 64 D 92 6: 116 A 28 7: 235 F 91 8: 259 D 29 9: 313 C 93
id site gender maxhr 1: 11 A female 171 2: 89 B female 182 3: 412 E female 200
id site baseEF 1: 56 B 72 2: 200 A 75 3: 272 A 72 4: 366 E 74 5: 406 B 77 6: 433 A 74 7: 495 D 72 8: 496 E 74
id site baseEF 1: 299 D 79 2: 434 E 83
id site sbp maxhr 1: 51 B 309 146 2: 146 E 283 135 3: 353 D 274 117
id site Check Values 1: 11 A gender == "female" & maxhr > 170 female 171 2: 14 B age < 30 | age > 90 29 3: 23 F age < 30 | age > 90 91 4: 30 D age < 30 | age > 90 26 5: 51 B sbp > 250 & maxhr < 160 309 146 6: 56 B baseEF [72, 77] 72 7: 60 A age < 30 | age > 90 91 8: 64 D age < 30 | age > 90 92 9: 89 B gender == "female" & maxhr > 170 female 182 10: 116 A age < 30 | age > 90 28 11: 146 E sbp > 250 & maxhr < 160 283 135 12: 200 A baseEF [72, 77] 75 13: 235 F age < 30 | age > 90 91 14: 259 D age < 30 | age > 90 29 15: 272 A baseEF [72, 77] 72 16: 299 D baseEF > 77 79 17: 313 C age < 30 | age > 90 93 18: 353 D sbp > 250 & maxhr < 160 274 117 19: 366 E baseEF [72, 77] 74 20: 406 B baseEF [72, 77] 77 21: 412 E gender == "female" & maxhr > 170 female 200 22: 433 A baseEF [72, 77] 74 23: 434 E baseEF > 77 83 24: 495 D baseEF [72, 77] 72 25: 496 E baseEF [72, 77] 74 id site Check Values
Check n 1 age < 30 | age > 90 9 2 gender == "female" & maxhr > 170 3 3 baseEF [72, 77] 8 4 baseEF > 77 2 5 baseEF > 99 0 6 sbp > 250 & maxhr < 160 3