constuctor to create PooledArray sharing a pool from another PooledArray
In the case where you have 2+ PooledArrays and you want them to share a single pool, it would be useful to have a constructor that's something like
foo = PooledArray(rand(["a","b","c"], 5000))
bar = PooledArray(rand(["a","b","c"], 100), foo)
where bar is a PooledArray of 100 elements that share the pool of foo. Therefore, if one was to change foo.pool[1] = "zebra", every occurrence of "a" in bar would become "zebra"
The constructor would look something like
PooledArray(x, y::PooledArray)
I'm not sure of what a clean and efficient process for that would look like. Perhaps since the pool already exists, replacing every occurrence of a unique element with the corresponding pool index?
if one was to change
foo.pool[1] = "zebra", every occurrence of"a"in bar would become"zebra"
This should never be done. pool is a private field of PooledArray. What you propose would corrupt PooledArray.
The constructor would look something like
The signature is up to the discussion, the implementation should be something like (probably something more efficient but I am showing that we already almost have it):
res = y[fill.(1, size(x))...]
res .= x
and now res shares pool with y assuming that elements of x were a subset of pool of y.
Forgive my ignorance, but renaming a pool entry would corrupt the PooledArray? What would be the idiomatic and safe way to change all occurrences of a single pool item?
- you check if a new value of the pool item is already present in the pool or not.
- if it is present - then there is no way to make a fast change; you need to do a full table scan and update
- if it is not present then assuming you are sure that no other thread is mutating
PooledArrayyou can:- set
poolthe way you proposed - remove from
invpoolthe mapping from the old value that you removed - add to
invpoola mapping to the new value that you have changed to
- set
Note though that points described in step 3 are implementation detail that might change in the future.
We do not expose this as an official API, as we do copy-on-write of pool, which means that several arrays can share the same pool so exposing such functionality would be very error prone.
Could you describe your use case? It would make sense to provide a constructor to share pools between arrays, but that would only have an effect on performance. What you seem to be asking for is a way to synchronize arrays (a kind of "spooky action at a distance"? :-D).
In PopGen.jl the main data struct is
struct PopData
metadata::DataFrame
genodata::DataFrame
end
The metadata df has sample info (name, ploidy, population, geo coords) in wide format and the genodata is long format of (name, population,locus, genotype). In the long format table, all columns except genotype are PooledArray. My hope was to have the two DataFrames linked such that mutating operations on one would be reflected on the other. For example, renaming a sample in the metadata would also rename it in the genodata, etc. Spooky synchronization is exactly what I had in mind!
I see. That sounds like a legitimate use case for this, but it's tricky in particular because of thread safety issues. @bkamins worked hard to find a way to implement copy-on-write of the pools, ensuring that while two arrays can share their pools (e.g. with y = x[1:6]), as soon as an attempt is made to modify the pool it is copied so that the other array isn't affected. Otherwise it would be impossible to write thread-safe code, since incorrect results could appear if one pool is mutated while reading entries in the other array from another thread. Even though when you create such arrays you are aware of the fact that they share their pools, you cannot be sure of what will happen to them if they are passed to package code, if somebody refactors the code later, if users extract arrays from PopData structs without being aware of this, etc.
This wouldn't be a problem if we had a way to make pools thread-safe, but that appears to be impossible to do without completely killing performance of getindex. And of course that's non-negociable...
That said, if you really don't care about thread safety, it should be relatively easy to allow creating two arrays that totally share their pools. But then your users may get trapped by this if they mutate the columns of the data frames. Or maybe they are not supposed to do it at all?
That said, if you really don't care about thread safety, it should be relatively easy to allow creating two arrays that totally share their pools.
It is already possible - just use an inner constructor explicitly.
But then your users may get trapped by this if they mutate the columns of the data frames.
This is not that bad, as currently we do not allow removing levels from pool - you can only add levels, so essentially only thread safety is an issue. However, this might change in the future, see https://github.com/JuliaData/DataAPI.jl/pull/31.