TileDB icon indicating copy to clipboard operation
TileDB copied to clipboard

perf: [sc-60446] Enumerations loaded from schema execute `generate_value_map` in a background task

Open rroelke opened this issue 1 year ago • 1 comments

Story details: https://app.shortcut.com/tiledb-inc/story/60446

In the story we have observed that generate_value_map can take a non-trivial amount of time to execute. This function builds a map from enumeration variant to key, so that we can detect duplicates and do fast lookups.

However, generate_value_map is always called when we construct an Enumeration. However:

  1. barring a data corruption, an Enumeration which is constructed as we load it from an ArraySchema was already validated when it was either created or extended!
  2. lookup of the key for a variant is an uncommon operation, merged in #5482 which has not appeared in a release as of this writing.

Hence all users of Enumeration are paying the potentially expensive cost of an operation whose result is never used.

This pull request adds an option to migrate the execution of generate_value_map to a background task. If the bool async flag is true then generate_value_map will be called from a thread pool task, and Enumeration::value_map() will wait for that computation to finish. We construct Enumerations with async = true when loading them for an ArraySchema and async = false when they are created by the user (in which case we want the duplicate validation to happen immediately).

If an exception is thrown due to a corrupted Enumeration for some reason, then we will only see it upon calling value_map() in which case the exception thrown while executing the future will be thrown. This is tested in the added test case Enumeration Creation Error - Async.

This requires threading a ContextResources through all of the call stacks which find their way to Enumeration::create. This represents most of the changes in this PR, which are largely encapsulated in the single commit 28f97ce, so I recommend reviewing commit by commit.

I have also added timers to measure time spent in generate_value_map and time spent waiting for the it to finish.


TYPE: IMPROVEMENT DESC: async Enumeration::generate_value_map

rroelke avatar Apr 15 '25 21:04 rroelke

We can generate value_map concurrently with other steps of the loading process if we want, but we must make sure all work finishes before the C API returns to the user. Otherwise we get an illusion of performance improvement that will likely make things worse in highly concurrent environments like the server.

teo-tsirpanis avatar Apr 16 '25 10:04 teo-tsirpanis