DoubleFloats.jl

DoubleFloats.jl copied to clipboard

Reame
Issues

Overflow behaviour of Double64 does not match Float64

Open KlausC opened this issue 3 years ago • 84 comments

Please re-open this issue: The root cause seems to be in DoubleFloats:

julia> Double64(1e300)^2
NaN

julia> Float64(1e300)^2
Inf

Originally posted by @KlausC in https://github.com/JuliaMath/DoubleFloats.jl/issues/149#issuecomment-1133882519

May 22 '22 12:05 KlausC

A fix could look like so: in DoubleFloats/src/math/errorfree.jl:120

"""
    two_hilo_sum(a, b)

*unchecked* requirement `|a| ≥ |b|`
Computes `s = fl(a+b)` and `e = err(a+b)`.
"""
@inline function two_hilo_sum(a::T, b::T) where {T<:FloatWithFMA}
    isfinite(b) || return a, a
    s = a + b
    e = b - (s - a)
    return s, e
end

May 22 '22 13:05 KlausC

This is a deep, heavily called function. I need to see if there is a way to resolve the issue that is less impactful.

May 22 '22 18:05 JeffreySarnoff

Fortunately, this anomaly is not pervasive.

julia> exp(Double64(1e300))
Inf
julia> tanpi(Double64(1/2))
Inf

If the behavior is limited to one or just a few functions, it makes more sense to trap this case within ^ etc. I am checking other functions for the same behavior.

May 23 '22 09:05 JeffreySarnoff

This issue is based in a corner case for multiplication or squaring, when the magnitudes get huge:

julia> a=Double64(sqrt(floatmax(Float64)))
1.3407807929942596e154

julia> a*a
1.7976931348623155e308

julia> a=Double64(1.000000000000001*sqrt(floatmax(Float64)))
1.340780792994261e154

julia> a*a
NaN

The next step is looking into this more deeply.

May 23 '22 09:05 JeffreySarnoff

omitted

May 23 '22 10:05 JeffreySarnoff

I think, the issue is with all base operations (+, -, *, /), which all return NaN in cases when they should return Inf or -Inf. `` julia> a = floatmax(Double64) 1.7976931348623157e308

julia> a + a NaN

julia> -a - a NaN

julia> a * a NaN

julia> a / Double64(0.0) Inf

julia> a / (1/a) NaN

julia> 1/a 5.562684646268003e-309`

May 23 '22 11:05 KlausC

That is clear, and helpful.

May 23 '22 11:05 JeffreySarnoff

I found the problem -- when the result is Inf the Double64 computation can result in HILO(result) == (Inf, -Inf) [or (-Inf, Inf) which prints as NaN when it should print as HI(result). I need to see if this is enough or if the result must be remade to the general form used in the package (Inf, NaN) or (-Inf, NaN) to work in carry through calculations.

May 23 '22 11:05 JeffreySarnoff

THIS IS INCORRECT, see next

The results need to be remade -- or something else modified

julia> exp(Double64(Inf))
Inf

julia> exp(Double64((Inf,-Inf)) # !! incorrect initialization
NaN

julia> dump(ans)
Double64
  hi: Float64 NaN
  lo: Float64 NaN

May 23 '22 11:05 JeffreySarnoff

Ignore last comment --

julia> b=Double64(Inf,-Inf)
Inf

julia> HILO(b)
(Inf, -Inf)

julia> exp(b)
Inf

May 23 '22 11:05 JeffreySarnoff

still though, there is the problem you highlighted .. and unfortunately not a string display issue

a = floatmax(Float64)
b = floatmax(Double64)

julia> a = floatmax(Float64)
1.7976931348623157e308

julia> b = floatmax(Double64)
1.7976931348623157e308

julia> a+a
Inf

julia> b+b
NaN

julia> HILO(ans)
(NaN, NaN)

May 23 '22 11:05 JeffreySarnoff

I think we must first make the base operation correct. The exp example should be considered later.


julia> d = Double64(Inf, -Inf)
Inf

julia> d + d
NaN

julia> 

julia> d = Double64(Inf, Inf)
Inf

julia> d + d
NaN

May 23 '22 11:05 KlausC

It appears that I have to check for Inf after the initial arithmetic operation (establishing the HI part) and return that immediately in the event it is +/- Inf (or NaN .. although that will propagate). This changes the routines in op_dddd_dd.jl and probably those where one of the args is dd and the other Float64 (although I have not checked yet).

May 23 '22 11:05 JeffreySarnoff

I agree. Question is where to insert the checks to avoid runtime regression afa possible. I my experience the intermediate terms are "contaminated" with NaN , when operations like Inf - Inf are induced.

May 23 '22 12:05 KlausC

the obvious place is to change (and similarly for sub_, mul_ dvi_)

@inline function add_dddd_dd(x::Tuple{T,T}, y::Tuple{T,T}) where T<:IEEEFloat
    xhi, xlo = x
    yhi, ylo = y
    hi, lo = two_sum(xhi, yhi)
    thi, tlo = two_sum(xlo, ylo)
    c = lo + thi
    hi, lo = two_hilo_sum(hi, c)
    c = tlo + lo
    hi, lo = two_hilo_sum(hi, c)
    return hi, lo
end

into

@inline function add_dddd_dd(x::Tuple{T,T}, y::Tuple{T,T}) where T<:IEEEFloat
    xhi, xlo = x
    yhi, ylo = y
    hi, lo = two_sum(xhi, yhi)
    isinf(hi) && return (hi, lo)                # inserting this
    thi, tlo = two_sum(xlo, ylo)
    c = lo + thi
    hi, lo = two_hilo_sum(hi, c)
    c = tlo + lo
    hi, lo = two_hilo_sum(hi, c)
    return hi, lo
end

It would be better if there were a less pervasive way, though.

May 23 '22 13:05 JeffreySarnoff

Consequently all other *_dddd_dd should be modified accordingly.

But as I see, that add_dddd_dd is called many times internally, it is maybe possible to change only in the user-facing situations. For example add add_dddd_dd_checked modified like above, then call that only in add_dbdb_db.

All other calls of *_dddd_dd must be verified as well to find out, if they have to be replaced by the _checked versions.

May 23 '22 13:05 KlausC

Hmm, good thought.

May 23 '22 13:05 JeffreySarnoff

e.g. replacing (the | should be || in any event)

@inline function add_dbdb_db(x::DoubleFloat{T}, y::DoubleFloat{T}) where {T<:IEEEFloat}
    (isnan(LO(x)) | isnan(LO(y))) && return add_dbdb_db_nonfinite(x,y)
    return DoubleFloat{T}(add_dddd_dd(HILO(x), HILO(y)))
end

with

@inline function add_dbdb_db(x::DoubleFloat{T}, y::DoubleFloat{T}) where {T<:IEEEFloat}
    !isfinite(HI(x) + HI(y)) && return add_dbdb_db_nonfinite(x,y)
    return DoubleFloat{T}(add_dddd_dd(HILO(x), HILO(y)))
end

or

@inline function add_dbdb_db(x::DoubleFloat{T}, y::DoubleFloat{T}) where {T<:IEEEFloat}
    isfinite(HI(x) + HI(y)) && return DoubleFloat{T}(add_dddd_dd(HILO(x), HILO(y)))
    add_dbdb_db_nonfinite(x,y)
end

May 23 '22 13:05 JeffreySarnoff

That would work at the cost of one additional Float64 op.

(Actually I did not understand, why the original test for infinity was based on the NaNs in the LO parts of the arguments)

May 23 '22 14:05 KlausC

The encoding of (+/-Inf, NaN) for +/-Inf and (NaN, NaN) for NaN was chosen so that isnan(LO(x)) would select nonfinite values (it was the fastest way that occurred to me then). However, the following could be used on (or HI(x::Double64)) too:

isnonfinite(x::Float64) = reinterpret(UInt64,x) & 0x7FFFFFFFFFFFFFFF >= 0x7FF0000000000000
isfinite(x::Float64) = reinterpret(UInt64,x) & 0x7FFFFFFFFFFFFFFF < 0x7FF0000000000000

May 23 '22 14:05 JeffreySarnoff

Your last proposal looks more obvious to me. ( I mean isfinite(x::Double64) = isfinite(HI(x)))

May 23 '22 14:05 KlausC

agreed

May 23 '22 14:05 JeffreySarnoff

The same corrections need to done to the DoubleT op FloatT (and FloatT op DoubleT) user facing routines.

May 23 '22 14:05 JeffreySarnoff

I will implement these changes on a new branch (with tests) and post here when done.

May 23 '22 14:05 JeffreySarnoff

I saw, that also square, cube, ^ involved.

May 23 '22 14:05 KlausC

ok .. let me know if you notice others

May 23 '22 15:05 JeffreySarnoff

I understand that the ops in op_dddd_dd.jl and op_ddfp_dd.jl and op_fpdd_dd.jl and op_dd_dd.jl and many or all in op_fp_dd.jl, op_ddsi_dd.jl need this adjustment.

May 23 '22 15:05 JeffreySarnoff

Also returning NaN: cbrt (sqrt works), add2, sub2, mul2, div2 from arith.jl

May 23 '22 15:05 KlausC

I understand that the ops in op_dddd_dd.jl and op_ddfp_dd.jl and op_fpdd_dd.jl and op_dd_dd.jl and many or all in op_fp_dd.jl, op_ddsi_dd.jl need this adjustment.

I am not sure about that - as those functions are not exported (afaik). Modifying only the exported functions (and sufficient test cases for those) would not have as much impact on performance, I guess.

May 23 '22 15:05 KlausC

That would be nice. Those other functions have the same sort of trouble with the same sort of args:

julia> floatmax(Float64) + floatmax(Double64(1.0))
NaN

julia> floatmax(Double64(1.0)) * 2048
NaN

May 23 '22 15:05 JeffreySarnoff