Problem Statement

A recent upgrade in R resulted in some numerical instability that appears to be platform dependent when doing ordinal regression in rms. Ordinal regression in this instance was detecting unique floating point numbers to assign groups for the regression on the \(Y\).

Floating Point Comparisons

Group assignments requires comparison of floating point numbers for equality.Doing comparisons for equality of floating point numbers is a known issue with any digital system, especially base 10 decimals encoded in binary. A common example is the encoding of 0.1 results in a repeating decimal.

\[\frac{1}{10} = \frac{1}{2} \sum^\infty_{n=1} \frac{3}{16^n} = 0.00011001100110011 \dots_2 \]

0.5*sum(3/16^(1:1000))
## [1] 0.1

The formula returns the correct value.

(0.3*3)+0.1 == 1
## [1] FALSE

However a simple comparison after some manipulation results in a problem.

(0.3*3)+0.1 - 1
## [1] -1.110223e-16

This demonstrates the numerical error accumlated by translating base 10 decimals to binary decimals and doing arithmetic. If floating point operations were perfect, this would be zero.

3 * 3 + 1 - 10
## [1] 0

Note that when the resolution is moved above the decimal, the correct result of zero now occurs.

See IEEE754 issues for more details.

The classic solution to this problem is abs(x - y) < tol for comparisons.

Indeterminant Unique

The problem is exacerbated further in detecting unique floating point numbers within a vector. An indeterminacy can arise in the case of three values based on the order of comparisons. For example take the set of values: 0.1, 0.2, and 0.3 and consider the case of tolerance being 0.15. Equality is non-transitive in this case.

grViz("graph ER {

a [label='0.1'];
b [label='0.2'];
c [label='0.3'];

a -- b [label='equal'];
b -- c [label='equal'];
a -- c [label='NOT EQUAL']; 
}")

One solution could be to use a clustering algorithm to assign, and in the above case if one chose 2 clusters, the value of 0.2’s assignment is equally likely to be 0.1 or the 0.3 group. Thus the choice is not deterministic.

factor, match, and unique all suffer from both the floating point equality problem and the indeterminacy of cluster assignment when resolution is close and it can depend on the order the data is presented!

Rounding Strategy

Rounding to a given precision under scaling is a numerically stable-strategy. This truncates the numbers to within the resolution or precision required. Distance is now regularized to fixed points on an arbitrary precision number line, and forced above the decimal line preventing any repeating digits.

y <- (1:3)/10
precision <- log10(1/0.15)

y.scaled <- round(y*10^precision)

as.integer(round(y*10^precision)) 
## [1] 1 1 2
length(unique(y.scaled))
## [1] 2

Under this strategy, 0.1 and 0.2 are equal within 0.15 but 0.3 is found to be a different grouping. Thus by using a regularized number line the indeterminacy of grouping is eliminated–as well issues with floating point representations for equality comparisons. The measurement is against a consistent number line graduation which is independent of the presented values.

plot(NA, NA, xlab="", ylab="Number Line",ylim=c(0,0.6), axes=FALSE, xlim=c(0, 0.5))
axis(side=1, at=c(0, 0.15, 0.3, 0.45), labels=c("0 (0)", "0.15 (1)", "0.3 (2)", "0.45 (3)"))
points(c(0.1, 0.2, 0.3), rep(0.5, 3))
lines(c(0.1, 0.15), c(0.5, 0), lty=2)
lines(c(0.2, 0.15), c(0.5, 0), lty=2)
lines(c(0.3, 0.3), c(0.5, 0), lty=2)
text(0.1, 0.5, "0.1", pos=3)
text(0.2, 0.5, "0.2", pos=3)
text(0.3, 0.5, "0.3", pos=3)

It is still theoretically possible to create a point that is exactly between two of these points on the number line. This case is handled by the decision made when performing a round(0.5).

The only thing at issue is choosing precision, which is left as a parameter to the user under the proposed pull request. 7 digits is a reasonable case.