PooledArrays.jl icon indicating copy to clipboard operation
PooledArrays.jl copied to clipboard

`pure=true` default to align with Base

Open ParadaCarleton opened this issue 4 years ago • 12 comments

I see why the pure keyword was added, but having pure=false contradicts Julia's behavior, which assumes mapped functions are pure. For instance:

julia> using SparseArrays

julia> A = sparse([1, 1, 2, 3], [1, 3, 2, 3], [3, 1, 2, 3])
3×3 SparseMatrixCSC{Int64, Int64} with 4 stored entries:
 3  ⋅  1
 ⋅  2  ⋅
 ⋅  ⋅  3

julia> map(x -> x^2 + rand(), A)
3×3 SparseMatrixCSC{Float64, Int64} with 9 stored entries:
 9.54687  0.85208  1.86548
 0.85208  4.31839  0.85208
 0.85208  0.85208  9.26578

This default behavior is also inconsistent with most programmers' expectations, as map is a functional construct and therefore tends to assume side-effect free functions. This could lead to bugs; it also means that any code using map on a vector of unknown type can't take advantage of the performance enhancements provided by PooledArrays, since pure is not a keyword for the map method in base. As a result, I propose defaulting to pure=true.

ParadaCarleton avatar Dec 05 '21 18:12 ParadaCarleton

This was not something we added because of an abstract thinking about a problem. It was the opposite - we changed it because users reported errors.

In particular your reasoning, while plausible is not consistent with map contract in Julia Base, which is:

Transform collection c by applying f to each element.

which clearly states that f must be applied to each element of c. If Julia Base changes this contract we can consider changing it here. Given this contract, I would say that your example from SparseArrays is a bug not a feature.

Additionally, I would expect most users of PooledArrays.jl to not be programmers but regular data scientists who are typically not aware of such subtleties unfortunately so we preferred to be on a safe side with the design.

bkamins avatar Dec 05 '21 18:12 bkamins

It's worth noting that in addition to sparse arrays, FillArrays has been using the efficient/pure implementation for a long time now, and I'm definitely wary of a diverging standard from other, very commonly-used packages.

ParadaCarleton avatar Feb 17 '22 16:02 ParadaCarleton

We could change this since in DataFrames.jl we are now safeguarded against this problem with other means.

@nalimilan - what do you think?

bkamins avatar Feb 17 '22 16:02 bkamins

I've commented at https://github.com/JuliaSparse/SparseArrays.jl/issues/4#issuecomment-1043459644.

TBH I doubt the pure=false default could create any bugs, as doing map(x -> x^2 + rand(), A) and assuming purity is a really weird thing to do -- especially given that this isn't clearly documented. But I agree we should use the same assumptions in all packages once we can clearly document the expectations in Base.

nalimilan avatar Feb 17 '22 21:02 nalimilan

OK - let us wait for core devs and then decide.

bkamins avatar Feb 17 '22 21:02 bkamins

I wonder whether the recent addition of various kinds of function purity (https://github.com/JuliaLang/julia/pull/43852) could be used here, so that by default we would treat the function as not pure (as currently), but for functions declared as pure (technically, I guess :consistent) we would only run them once for each value in the pool. Or maybe that's too technically advanced so that users won't take advantage of it.

nalimilan avatar Feb 24 '22 13:02 nalimilan

I think it makes sense, as I hope e.g. functions in Base and in most important packages can be annotated this way.

bkamins avatar Feb 24 '22 15:02 bkamins

I've commented at JuliaSparse/SparseArrays.jl#4 (comment).

TBH I doubt the pure=false default could create any bugs, as doing map(x -> x^2 + rand(), A) and assuming purity is a really weird thing to do -- especially given that this isn't clearly documented. But I agree we should use the same assumptions in all packages once we can clearly document the expectations in Base.

I don't think it can cause any true bugs, but I first commented when I noticed a massive performance regression in a package of mine that made it essentially unusable with big PooledArrays.

I think it makes sense, as I hope e.g. functions in Base and in most important packages can be annotated this way.

That would be good.

ParadaCarleton avatar Feb 25 '22 02:02 ParadaCarleton

I don't think it can cause any true bugs

We added this functionality because of the bugs reported by users.

However, I know that in many cases it causes a significant performance drop, so we need a good solution here.

bkamins avatar Feb 25 '22 06:02 bkamins

I don't think it can cause any true bugs, but I first commented when I noticed a massive performance regression in a package of mine that made it essentially unusable with big PooledArrays.

Do you think using the purity declarations added by https://github.com/JuliaLang/julia/pull/43852) would be an option in that use case?

nalimilan avatar Mar 01 '22 13:03 nalimilan

I don't think it can cause any true bugs

We added this functionality because of the bugs reported by users.

However, I know that in many cases it causes a significant performance drop, so we need a good solution here.

It looks like it is, indeed, intended behavior for map to assume purity. In that case, the current behavior should be considered a bug, since someone looking for map(x->rand()) might want to make sure map is producing identical outputs for identical inputs -- e.g. if mapping each participant's name to a randomly-generated ID.

ParadaCarleton avatar Apr 03 '22 01:04 ParadaCarleton

When https://github.com/JuliaLang/julia/pull/44233 in Base is decided can you please ping us here. Thank you!

bkamins avatar Apr 03 '22 06:04 bkamins