PooledArrays.jl constuctor to create PooledArray sharing a pool from another PooledArray

In the case where you have 2+ PooledArrays and you want them to share a single pool, it would be useful to have a constructor that's something like


foo = PooledArray(rand(["a","b","c"], 5000))

bar = PooledArray(rand(["a","b","c"], 100), foo)

where bar is a PooledArray of 100 elements that share the pool of foo. Therefore, if one was to change foo.pool[1] = "zebra", every occurrence of "a" in bar would become "zebra"

The constructor would look something like

PooledArray(x, y::PooledArray)

I'm not sure of what a clean and efficient process for that would look like. Perhaps since the pool already exists, replacing every occurrence of a unique element with the corresponding pool index?

Aug 03 '21 15:08 pdimens

if one was to change foo.pool[1] = "zebra", every occurrence of "a" in bar would become "zebra"

This should never be done. pool is a private field of PooledArray. What you propose would corrupt PooledArray.

The constructor would look something like

The signature is up to the discussion, the implementation should be something like (probably something more efficient but I am showing that we already almost have it):

res = y[fill.(1, size(x))...]
res .= x

and now res shares pool with y assuming that elements of x were a subset of pool of y.

Aug 03 '21 15:08 bkamins

Forgive my ignorance, but renaming a pool entry would corrupt the PooledArray? What would be the idiomatic and safe way to change all occurrences of a single pool item?

Aug 03 '21 18:08 pdimens

you check if a new value of the pool item is already present in the pool or not.
if it is present - then there is no way to make a fast change; you need to do a full table scan and update
if it is not present then assuming you are sure that no other thread is mutating PooledArray you can:
- set pool the way you proposed
- remove from invpool the mapping from the old value that you removed
- add to invpool a mapping to the new value that you have changed to

Note though that points described in step 3 are implementation detail that might change in the future.

We do not expose this as an official API, as we do copy-on-write of pool, which means that several arrays can share the same pool so exposing such functionality would be very error prone.

Aug 03 '21 22:08 bkamins

Could you describe your use case? It would make sense to provide a constructor to share pools between arrays, but that would only have an effect on performance. What you seem to be asking for is a way to synchronize arrays (a kind of "spooky action at a distance"? :-D).

Sep 04 '21 13:09 nalimilan

In PopGen.jl the main data struct is

struct PopData
    metadata::DataFrame
    genodata::DataFrame
end

The metadata df has sample info (name, ploidy, population, geo coords) in wide format and the genodata is long format of (name, population,locus, genotype). In the long format table, all columns except genotype are PooledArray. My hope was to have the two DataFrames linked such that mutating operations on one would be reflected on the other. For example, renaming a sample in the metadata would also rename it in the genodata, etc. Spooky synchronization is exactly what I had in mind!

Sep 04 '21 14:09 pdimens

I see. That sounds like a legitimate use case for this, but it's tricky in particular because of thread safety issues. @bkamins worked hard to find a way to implement copy-on-write of the pools, ensuring that while two arrays can share their pools (e.g. with y = x[1:6]), as soon as an attempt is made to modify the pool it is copied so that the other array isn't affected. Otherwise it would be impossible to write thread-safe code, since incorrect results could appear if one pool is mutated while reading entries in the other array from another thread. Even though when you create such arrays you are aware of the fact that they share their pools, you cannot be sure of what will happen to them if they are passed to package code, if somebody refactors the code later, if users extract arrays from PopData structs without being aware of this, etc.

This wouldn't be a problem if we had a way to make pools thread-safe, but that appears to be impossible to do without completely killing performance of getindex. And of course that's non-negociable...

That said, if you really don't care about thread safety, it should be relatively easy to allow creating two arrays that totally share their pools. But then your users may get trapped by this if they mutate the columns of the data frames. Or maybe they are not supposed to do it at all?

Sep 04 '21 20:09 nalimilan

That said, if you really don't care about thread safety, it should be relatively easy to allow creating two arrays that totally share their pools.

It is already possible - just use an inner constructor explicitly.

But then your users may get trapped by this if they mutate the columns of the data frames.

This is not that bad, as currently we do not allow removing levels from pool - you can only add levels, so essentially only thread safety is an issue. However, this might change in the future, see https://github.com/JuliaData/DataAPI.jl/pull/31.

Sep 04 '21 21:09 bkamins