flowchart LR Ran[Range Checks] Con["Cross-Variable Consistency"] Rep[Checking and Reporting<br>With Minimal Coding] --> Li[Listings] & S[Summaries] Ran --> Rep Con --> Rep
7 Data Checking
Besides useful descriptive statistics exemplified below, it is important to flag suspicious values in an automated way. Since checking multiple columns may involve a large number of R expressions to run to classify observations as suspicious, let’s automate the process somewhat by specifying a vector of expressions. Then we have R “compute on the language” to parse the expressions for finding observations to flag, and for printing. This is done by the dataChk function in qreport.
The following code results in separate output for each individual data check, in separate Quarto tabs. The dataset does not have a subject ID variable so let’s create one, and also add a site variable to print. Arguments are specified to dataChk so that no tab is produced for a condition that never occurred in the data, and a tab is produced showing all data flags, sorted by id and site.
require(Hmisc)
require(data.table)
require(qreport)
getHdata(stressEcho)
w <- stressEcho
setDT(w)
w[, id := 1 : .N]
set.seed(1)
w[, site := sample(LETTERS[1:6], .N, replace=TRUE)]
checks <- expression(
age < 30 | age > 90,
gender == 'female' & maxhr > 170,
baseEF %between% c(72, 77),
baseEF > 77,
baseEF > 99,
sbp > 250 & maxhr < 160)
dataChk(w, checks, id=c('id', 'site'),
omit0=TRUE, byid=TRUE, html=TRUE) id site age
1: 14 B 29
2: 23 F 91
3: 30 D 26
4: 60 A 91
5: 64 D 92
6: 116 A 28
7: 235 F 91
8: 259 D 29
9: 313 C 93
id site gender maxhr
1: 11 A female 171
2: 89 B female 182
3: 412 E female 200
id site baseEF
1: 56 B 72
2: 200 A 75
3: 272 A 72
4: 366 E 74
5: 406 B 77
6: 433 A 74
7: 495 D 72
8: 496 E 74
id site baseEF
1: 299 D 79
2: 434 E 83
id site sbp maxhr
1: 51 B 309 146
2: 146 E 283 135
3: 353 D 274 117
Key:id site Check Values 1: 11 A gender == "female" & maxhr > 170 female 171 2: 14 B age < 30 | age > 90 29 3: 23 F age < 30 | age > 90 91 4: 30 D age < 30 | age > 90 26 5: 51 B sbp > 250 & maxhr < 160 309 146 6: 56 B baseEF [72, 77] 72 7: 60 A age < 30 | age > 90 91 8: 64 D age < 30 | age > 90 92 9: 89 B gender == "female" & maxhr > 170 female 182 10: 116 A age < 30 | age > 90 28 11: 146 E sbp > 250 & maxhr < 160 283 135 12: 200 A baseEF [72, 77] 75 13: 235 F age < 30 | age > 90 91 14: 259 D age < 30 | age > 90 29 15: 272 A baseEF [72, 77] 72 16: 299 D baseEF > 77 79 17: 313 C age < 30 | age > 90 93 18: 353 D sbp > 250 & maxhr < 160 274 117 19: 366 E baseEF [72, 77] 74 20: 406 B baseEF [72, 77] 77 21: 412 E gender == "female" & maxhr > 170 female 200 22: 433 A baseEF [72, 77] 74 23: 434 E baseEF > 77 83 24: 495 D baseEF [72, 77] 72 25: 496 E baseEF [72, 77] 74 id site Check Values
Check n 1 age < 30 | age > 90 9 2 gender == "female" & maxhr > 170 3 3 baseEF [72, 77] 8 4 baseEF > 77 2 5 baseEF > 99 0 6 sbp > 250 & maxhr < 160 3