`pure=true` default to align with Base
I see why the pure keyword was added, but having pure=false contradicts Julia's behavior, which assumes mapped functions are pure. For instance:
julia> using SparseArrays
julia> A = sparse([1, 1, 2, 3], [1, 3, 2, 3], [3, 1, 2, 3])
3×3 SparseMatrixCSC{Int64, Int64} with 4 stored entries:
3 ⋅ 1
⋅ 2 ⋅
⋅ ⋅ 3
julia> map(x -> x^2 + rand(), A)
3×3 SparseMatrixCSC{Float64, Int64} with 9 stored entries:
9.54687 0.85208 1.86548
0.85208 4.31839 0.85208
0.85208 0.85208 9.26578
This default behavior is also inconsistent with most programmers' expectations, as map is a functional construct and therefore tends to assume side-effect free functions. This could lead to bugs; it also means that any code using map on a vector of unknown type can't take advantage of the performance enhancements provided by PooledArrays, since pure is not a keyword for the map method in base. As a result, I propose defaulting to pure=true.
This was not something we added because of an abstract thinking about a problem. It was the opposite - we changed it because users reported errors.
In particular your reasoning, while plausible is not consistent with map contract in Julia Base, which is:
Transform collection c by applying f to each element.
which clearly states that f must be applied to each element of c. If Julia Base changes this contract we can consider changing it here. Given this contract, I would say that your example from SparseArrays is a bug not a feature.
Additionally, I would expect most users of PooledArrays.jl to not be programmers but regular data scientists who are typically not aware of such subtleties unfortunately so we preferred to be on a safe side with the design.
It's worth noting that in addition to sparse arrays, FillArrays has been using the efficient/pure implementation for a long time now, and I'm definitely wary of a diverging standard from other, very commonly-used packages.
We could change this since in DataFrames.jl we are now safeguarded against this problem with other means.
@nalimilan - what do you think?
I've commented at https://github.com/JuliaSparse/SparseArrays.jl/issues/4#issuecomment-1043459644.
TBH I doubt the pure=false default could create any bugs, as doing map(x -> x^2 + rand(), A) and assuming purity is a really weird thing to do -- especially given that this isn't clearly documented. But I agree we should use the same assumptions in all packages once we can clearly document the expectations in Base.
OK - let us wait for core devs and then decide.
I wonder whether the recent addition of various kinds of function purity (https://github.com/JuliaLang/julia/pull/43852) could be used here, so that by default we would treat the function as not pure (as currently), but for functions declared as pure (technically, I guess :consistent) we would only run them once for each value in the pool. Or maybe that's too technically advanced so that users won't take advantage of it.
I think it makes sense, as I hope e.g. functions in Base and in most important packages can be annotated this way.
I've commented at JuliaSparse/SparseArrays.jl#4 (comment).
TBH I doubt the
pure=falsedefault could create any bugs, as doingmap(x -> x^2 + rand(), A)and assuming purity is a really weird thing to do -- especially given that this isn't clearly documented. But I agree we should use the same assumptions in all packages once we can clearly document the expectations in Base.
I don't think it can cause any true bugs, but I first commented when I noticed a massive performance regression in a package of mine that made it essentially unusable with big PooledArrays.
I think it makes sense, as I hope e.g. functions in Base and in most important packages can be annotated this way.
That would be good.
I don't think it can cause any true bugs
We added this functionality because of the bugs reported by users.
However, I know that in many cases it causes a significant performance drop, so we need a good solution here.
I don't think it can cause any true bugs, but I first commented when I noticed a massive performance regression in a package of mine that made it essentially unusable with big
PooledArrays.
Do you think using the purity declarations added by https://github.com/JuliaLang/julia/pull/43852) would be an option in that use case?
I don't think it can cause any true bugs
We added this functionality because of the bugs reported by users.
However, I know that in many cases it causes a significant performance drop, so we need a good solution here.
It looks like it is, indeed, intended behavior for map to assume purity. In that case, the current behavior should be considered a bug, since someone looking for map(x->rand()) might want to make sure map is producing identical outputs for identical inputs -- e.g. if mapping each participant's name to a randomly-generated ID.
When https://github.com/JuliaLang/julia/pull/44233 in Base is decided can you please ping us here. Thank you!