PooledArrays.jl `pure=true` default to align with Base

I see why the pure keyword was added, but having pure=false contradicts Julia's behavior, which assumes mapped functions are pure. For instance:

julia> using SparseArrays

julia> A = sparse([1, 1, 2, 3], [1, 3, 2, 3], [3, 1, 2, 3])
3×3 SparseMatrixCSC{Int64, Int64} with 4 stored entries:
 3  ⋅  1
 ⋅  2  ⋅
 ⋅  ⋅  3

julia> map(x -> x^2 + rand(), A)
3×3 SparseMatrixCSC{Float64, Int64} with 9 stored entries:
 9.54687  0.85208  1.86548
 0.85208  4.31839  0.85208
 0.85208  0.85208  9.26578

This default behavior is also inconsistent with most programmers' expectations, as map is a functional construct and therefore tends to assume side-effect free functions. This could lead to bugs; it also means that any code using map on a vector of unknown type can't take advantage of the performance enhancements provided by PooledArrays, since pure is not a keyword for the map method in base. As a result, I propose defaulting to pure=true.

Dec 05 '21 18:12 ParadaCarleton

This was not something we added because of an abstract thinking about a problem. It was the opposite - we changed it because users reported errors.

In particular your reasoning, while plausible is not consistent with map contract in Julia Base, which is:

Transform collection c by applying f to each element.

which clearly states that f must be applied to each element of c. If Julia Base changes this contract we can consider changing it here. Given this contract, I would say that your example from SparseArrays is a bug not a feature.

Additionally, I would expect most users of PooledArrays.jl to not be programmers but regular data scientists who are typically not aware of such subtleties unfortunately so we preferred to be on a safe side with the design.

Dec 05 '21 18:12 bkamins

It's worth noting that in addition to sparse arrays, FillArrays has been using the efficient/pure implementation for a long time now, and I'm definitely wary of a diverging standard from other, very commonly-used packages.

Feb 17 '22 16:02 ParadaCarleton

We could change this since in DataFrames.jl we are now safeguarded against this problem with other means.

@nalimilan - what do you think?

Feb 17 '22 16:02 bkamins

I've commented at https://github.com/JuliaSparse/SparseArrays.jl/issues/4#issuecomment-1043459644.

TBH I doubt the pure=false default could create any bugs, as doing map(x -> x^2 + rand(), A) and assuming purity is a really weird thing to do -- especially given that this isn't clearly documented. But I agree we should use the same assumptions in all packages once we can clearly document the expectations in Base.

Feb 17 '22 21:02 nalimilan

OK - let us wait for core devs and then decide.

Feb 17 '22 21:02 bkamins

I wonder whether the recent addition of various kinds of function purity (https://github.com/JuliaLang/julia/pull/43852) could be used here, so that by default we would treat the function as not pure (as currently), but for functions declared as pure (technically, I guess :consistent) we would only run them once for each value in the pool. Or maybe that's too technically advanced so that users won't take advantage of it.

Feb 24 '22 13:02 nalimilan

I think it makes sense, as I hope e.g. functions in Base and in most important packages can be annotated this way.

Feb 24 '22 15:02 bkamins

I've commented at JuliaSparse/SparseArrays.jl#4 (comment).

TBH I doubt the pure=false default could create any bugs, as doing map(x -> x^2 + rand(), A) and assuming purity is a really weird thing to do -- especially given that this isn't clearly documented. But I agree we should use the same assumptions in all packages once we can clearly document the expectations in Base.

I don't think it can cause any true bugs, but I first commented when I noticed a massive performance regression in a package of mine that made it essentially unusable with big PooledArrays.

I think it makes sense, as I hope e.g. functions in Base and in most important packages can be annotated this way.

That would be good.

Feb 25 '22 02:02 ParadaCarleton

I don't think it can cause any true bugs

We added this functionality because of the bugs reported by users.

However, I know that in many cases it causes a significant performance drop, so we need a good solution here.

Feb 25 '22 06:02 bkamins

I don't think it can cause any true bugs, but I first commented when I noticed a massive performance regression in a package of mine that made it essentially unusable with big PooledArrays.

Do you think using the purity declarations added by https://github.com/JuliaLang/julia/pull/43852) would be an option in that use case?

Mar 01 '22 13:03 nalimilan

I don't think it can cause any true bugs

We added this functionality because of the bugs reported by users.

However, I know that in many cases it causes a significant performance drop, so we need a good solution here.

It looks like it is, indeed, intended behavior for map to assume purity. In that case, the current behavior should be considered a bug, since someone looking for map(x->rand()) might want to make sure map is producing identical outputs for identical inputs -- e.g. if mapping each participant's name to a randomly-generated ID.

Apr 03 '22 01:04 ParadaCarleton

When https://github.com/JuliaLang/julia/pull/44233 in Base is decided can you please ping us here. Thank you!

Apr 03 '22 06:04 bkamins