Overflow behaviour of Double64 does not match Float64
Please re-open this issue:
The root cause seems to be in DoubleFloats:
julia> Double64(1e300)^2
NaN
julia> Float64(1e300)^2
Inf
Originally posted by @KlausC in https://github.com/JuliaMath/DoubleFloats.jl/issues/149#issuecomment-1133882519
A fix could look like so: in DoubleFloats/src/math/errorfree.jl:120
"""
two_hilo_sum(a, b)
*unchecked* requirement `|a| ≥ |b|`
Computes `s = fl(a+b)` and `e = err(a+b)`.
"""
@inline function two_hilo_sum(a::T, b::T) where {T<:FloatWithFMA}
isfinite(b) || return a, a
s = a + b
e = b - (s - a)
return s, e
end
This is a deep, heavily called function. I need to see if there is a way to resolve the issue that is less impactful.
Fortunately, this anomaly is not pervasive.
julia> exp(Double64(1e300))
Inf
julia> tanpi(Double64(1/2))
Inf
If the behavior is limited to one or just a few functions, it makes more sense to trap this case within ^ etc. I am checking other functions for the same behavior.
This issue is based in a corner case for multiplication or squaring, when the magnitudes get huge:
julia> a=Double64(sqrt(floatmax(Float64)))
1.3407807929942596e154
julia> a*a
1.7976931348623155e308
julia> a=Double64(1.000000000000001*sqrt(floatmax(Float64)))
1.340780792994261e154
julia> a*a
NaN
The next step is looking into this more deeply.
omitted
I think, the issue is with all base operations (+, -, *, /), which all return NaN in cases when they should return Inf or -Inf.
``
julia> a = floatmax(Double64)
1.7976931348623157e308
julia> a + a NaN
julia> -a - a NaN
julia> a * a NaN
julia> a / Double64(0.0) Inf
julia> a / (1/a) NaN
julia> 1/a 5.562684646268003e-309`
That is clear, and helpful.
I found the problem -- when the result is Inf the Double64 computation can result in HILO(result) == (Inf, -Inf) [or (-Inf, Inf) which prints as NaN when it should print as HI(result). I need to see if this is enough or if the result must be remade to the general form used in the package (Inf, NaN) or (-Inf, NaN) to work in carry through calculations.
THIS IS INCORRECT, see next
The results need to be remade -- or something else modified
julia> exp(Double64(Inf))
Inf
julia> exp(Double64((Inf,-Inf)) # !! incorrect initialization
NaN
julia> dump(ans)
Double64
hi: Float64 NaN
lo: Float64 NaN
Ignore last comment --
julia> b=Double64(Inf,-Inf)
Inf
julia> HILO(b)
(Inf, -Inf)
julia> exp(b)
Inf
still though, there is the problem you highlighted .. and unfortunately not a string display issue
a = floatmax(Float64)
b = floatmax(Double64)
julia> a = floatmax(Float64)
1.7976931348623157e308
julia> b = floatmax(Double64)
1.7976931348623157e308
julia> a+a
Inf
julia> b+b
NaN
julia> HILO(ans)
(NaN, NaN)
I think we must first make the base operation correct. The exp example should be considered later.
julia> d = Double64(Inf, -Inf)
Inf
julia> d + d
NaN
julia>
julia> d = Double64(Inf, Inf)
Inf
julia> d + d
NaN
It appears that I have to check for Inf after the initial arithmetic operation (establishing the HI part) and return that immediately in the event it is +/- Inf (or NaN .. although that will propagate). This changes the routines in op_dddd_dd.jl and probably those where one of the args is dd and the other Float64 (although I have not checked yet).
I agree. Question is where to insert the checks to avoid runtime regression afa possible.
I my experience the intermediate terms are "contaminated" with NaN , when operations like Inf - Inf are induced.
the obvious place is to change (and similarly for sub_, mul_ dvi_)
@inline function add_dddd_dd(x::Tuple{T,T}, y::Tuple{T,T}) where T<:IEEEFloat
xhi, xlo = x
yhi, ylo = y
hi, lo = two_sum(xhi, yhi)
thi, tlo = two_sum(xlo, ylo)
c = lo + thi
hi, lo = two_hilo_sum(hi, c)
c = tlo + lo
hi, lo = two_hilo_sum(hi, c)
return hi, lo
end
into
@inline function add_dddd_dd(x::Tuple{T,T}, y::Tuple{T,T}) where T<:IEEEFloat
xhi, xlo = x
yhi, ylo = y
hi, lo = two_sum(xhi, yhi)
isinf(hi) && return (hi, lo) # inserting this
thi, tlo = two_sum(xlo, ylo)
c = lo + thi
hi, lo = two_hilo_sum(hi, c)
c = tlo + lo
hi, lo = two_hilo_sum(hi, c)
return hi, lo
end
It would be better if there were a less pervasive way, though.
Consequently all other *_dddd_dd should be modified accordingly.
But as I see, that add_dddd_dd is called many times internally, it is maybe possible to change only in the user-facing situations.
For example add add_dddd_dd_checked modified like above, then call that only in add_dbdb_db.
All other calls of *_dddd_dd must be verified as well to find out, if they have to be replaced by the _checked versions.
Hmm, good thought.
e.g. replacing (the | should be || in any event)
@inline function add_dbdb_db(x::DoubleFloat{T}, y::DoubleFloat{T}) where {T<:IEEEFloat}
(isnan(LO(x)) | isnan(LO(y))) && return add_dbdb_db_nonfinite(x,y)
return DoubleFloat{T}(add_dddd_dd(HILO(x), HILO(y)))
end
with
@inline function add_dbdb_db(x::DoubleFloat{T}, y::DoubleFloat{T}) where {T<:IEEEFloat}
!isfinite(HI(x) + HI(y)) && return add_dbdb_db_nonfinite(x,y)
return DoubleFloat{T}(add_dddd_dd(HILO(x), HILO(y)))
end
or
@inline function add_dbdb_db(x::DoubleFloat{T}, y::DoubleFloat{T}) where {T<:IEEEFloat}
isfinite(HI(x) + HI(y)) && return DoubleFloat{T}(add_dddd_dd(HILO(x), HILO(y)))
add_dbdb_db_nonfinite(x,y)
end
That would work at the cost of one additional Float64 op.
(Actually I did not understand, why the original test for infinity was based on the NaNs in the LO parts of the arguments)
The encoding of (+/-Inf, NaN) for +/-Inf and (NaN, NaN) for NaN was chosen so that isnan(LO(x)) would select nonfinite values (it was the fastest way that occurred to me then). However, the following could be used on (or HI(x::Double64)) too:
isnonfinite(x::Float64) = reinterpret(UInt64,x) & 0x7FFFFFFFFFFFFFFF >= 0x7FF0000000000000
isfinite(x::Float64) = reinterpret(UInt64,x) & 0x7FFFFFFFFFFFFFFF < 0x7FF0000000000000
Your last proposal looks more obvious to me. ( I mean isfinite(x::Double64) = isfinite(HI(x)))
agreed
The same corrections need to done to the DoubleT op FloatT (and FloatT op DoubleT) user facing routines.
I will implement these changes on a new branch (with tests) and post here when done.
I saw, that also square, cube, ^ involved.
ok .. let me know if you notice others
I understand that the ops in op_dddd_dd.jl and op_ddfp_dd.jl and op_fpdd_dd.jl and op_dd_dd.jl and many or all in op_fp_dd.jl, op_ddsi_dd.jl need this adjustment.
Also returning NaN: cbrt (sqrt works), add2, sub2, mul2, div2 from arith.jl
I understand that the ops in op_dddd_dd.jl and op_ddfp_dd.jl and op_fpdd_dd.jl and op_dd_dd.jl and many or all in op_fp_dd.jl, op_ddsi_dd.jl need this adjustment.
I am not sure about that - as those functions are not exported (afaik). Modifying only the exported functions (and sufficient test cases for those) would not have as much impact on performance, I guess.
That would be nice. Those other functions have the same sort of trouble with the same sort of args:
julia> floatmax(Float64) + floatmax(Double64(1.0))
NaN
julia> floatmax(Double64(1.0)) * 2048
NaN