foundationdb Add support for the new metadata version functionality to the directory layer

The new metadata version functionality introduced in 6.1 could be helpful in the directory layer for determining whether a cached open directory has been deleted without needing to perform extra reads in every transaction. It should be possible for clients to do this themselves (i.e. update the metadata version when deleting directories and checking it when using a cached directory). However, it would be convenient if this could be provided by the layer itself with little or no client effort.

It may not be desirable to use the metadata version unconditionally, since some users of the directory layer may not need that functionality and instead want to use the metadata version for other purposes. It's also possible that a client may not want to use this functionality for all directories. For example, this may be the case if some directories are much less stable than others and aren't cached like the stable directories. We'll need to determine what level of control we want to provide and then decide on a reasonable API for it.

Apr 03 '19 23:04 ajbeamon

Good point that this integration is actually possible for clients to implement themselves.

Apr 04 '19 15:04 kocolosk

I've been looking at adding a directory subspace cache that would make use of the new metadata version key, and I was wondering what changes - if any - would be needed for the current Directory Layer.

I think that it would be easy to add a cache on top of the current layer, at the application layer, that would cache instance of Directory subspaces, using the metadata version key as the check. My plan is to start an async read request on that key, while the rest of the transaction handler is running (with potentially stale prefixes), and once it completes (before or at commit time) throw a retry-able exception so that the transaction handler starts again.

Looking at the directory layer itself, I think that the only operation that would absolutely require to touch the metadata version key would be deletions or renaming / moving of subspaces. I'm not sure if this would be required for creation of new subspaces: the only case where this would have an impact is if the cache would also store the absence of a subspace, or the result of listing all children of a subspace. I'm not sure if these two methods are very frequent, and maybe that could simply not be served from the cache?

So to recap, currently my current thinking is to change the Directory Layer to only touch the version key for change/deletion, not for creation of new subspaces. Some methods like checking for existence of listing sub-directories would not be served from the cache.

The rest of the cache implementation would be at the application layer (for now) and would hook with the retry loop handler so that it has the opportunity to check the metadataversion key before commit (or at the end of the retry loop for read-only transactions) and throw a retryable error (wiping the cache in the process).

Any thoughts on this?

May 27 '19 07:05 KrzysFR

My plan is to start an async read request on that key, while the rest of the transaction handler is running (with potentially stale prefixes), and once it completes (before or at commit time) throw a retry-able exception so that the transaction handler starts again.

I'm not sure if it would make things simpler for you, but I believe the metadata version key is fetched along with the read version. It would probably be reasonable to call this first thing in your transaction and block on it, given that every other operation would also be blocked on the read version. If it doesn't make any difference whether you know the metadata version before starting, then it's probably a very modest performance benefit to run it concurrently.

Looking at the directory layer itself, I think that the only operation that would absolutely require to touch the metadata version key would be deletions or renaming / moving of subspaces. I'm not sure if this would be required for creation of new subspaces: the only case where this would have an impact is if the cache would also store the absence of a subspace, or the result of listing all children of a subspace. I'm not sure if these two methods are very frequent, and maybe that could simply not be served from the cache?

I think this seems reasonable to me. If your only concern is that somebody doesn't use a deleted directory to pack keys, then I think the only thing you have to monitor is deletions. If you are using cached directories to interact with the directory hierarchy (such as listing, creating subdirectories, etc. using a cached directory), then you would want to monitor moves as well. Both of those seem like likely things you'd want to deal with for something that's general purpose. If you want to actually traverse the hierarchy in cache, then you need to monitor for creations. That seems less necessary in general, though possibly useful for some situations.

The only other thing I'd add is that if you plan to make this part of the directory layer itself, you should give some consideration to the comments I made originally regarding the flexibility of this mechanism. Not everybody will want to pollute the metadata version with this, and it may also be the case that they won't want to do it for all directories (for example, if some directories change frequently while others are fairly static). Also, if you are planning to make this part of the directory layer and are interested in the API behaving similar to that of other bindings, we should work together to hash out the high-level details.

May 28 '19 15:05 ajbeamon

Has anyone tackled this problem?

I've started actively hacking at my directory layer implementation to see how I could retrofit a cache without changing too much the API, and my initial conclusion is that it's not easy to do, given all the possible combinations of mutations that can be done from inside a single transaction (creating a subspace, getting a cached version, deleting/moving the same subspace, creating a new one with the previous name, ...).

I don't think it would be possible to write a multi-threaded cache context that would handle all the possible combinations of operations inside a transaction, let alone that works with multiple concurrent transactions. I took a look at the implementation of MetaDataVersionStampStoreStateCache in the Record Layer, but I'm not sure if this method can be adapted to the Directory Layer at all.

I then tried adding a separate TryOpenCached(...) API with the given contract that "you should only call this from transactions that DO NOT mutate the directory layer in any way", but it looks a bit clunky: when should you call TryOpen(..) vs TryOpenCached(...)? What kind of other operations in the same transaction are unsafe to do if you use this? How could you use two layers from the same transaction if they conflict on that rule?

Side note: one specific instance that could easily benefit from caching inside the Directory Layer is caching the "version" key per DL, which is checked before every read or write. Since this is not exposed to the outside, I think it could be used to reduce by one the number of requests.

Another approach was to not deal with caching explicitly in the Directory Layer's API, but to allow the caller to do the caching, but for it to be very efficient, I also need to add a lot of "metadataVersion" attributes alongside the "layer" key for each subspace, which require a breaking change in the Directory Layer (v1.1?)

Also, since the Directory Layer supports partitions, the cache of one partition must also be aware of any potential change in the parent partition (moving the folder in the parent partition) up to the root partition.

So to "revalidate" a cached directory subspace entry, one would need to load all the "metadataVersion" attributes of all the folders in the path up to the root, which seems a bit complex, and still require the same breaking change ("v1.1")

So, my final question is: does anyone think that the DL "v1.0" can be cached at all, or do we need a new DL 2.0 made from scratch to support caching? Or that with redwood and prefix compression, the DL itself may not be needed anymore?

Dec 09 '19 15:12 KrzysFR

I'm not aware of anyone having done work on this yet, but I can add a few thoughts.

Managing a cache in the face of simultaneous directory modifications in the same transaction is an interesting concern, though I'll note that the directory layer already has other problems in this regard and it may be unsafe to make these modifications anyway: https://github.com/apple/foundationdb/issues/895. It's probably interesting to think about, but this fact may give us more leeway to change the implementation.

If we ignore the multi-threaded mutation aspect within a single transaction, you may still need some way to understand the transaction-local changes to the cache for subsequent reads. I suppose for this you could store a cache overlay that represents the local changes and that somehow travels with a transaction, though I'm not sure exactly what the API for that should look like. A simpler but less performant approach may be to track whether any relevant modifications have been made within the current transaction and then afterward ignore the cache entirely. It seems reasonable to me in either case that the contract for this would be that reads happening simultaneously with a mutation to the directory layer could use old cached directories or the updated ones. If there's a risk of seeing some inconsistency (e.g. due to needing to read multiple cached entries), then some external synchronization would need to be used.

For multiple concurrent transactions, I think my initial thought was that a transaction operates using the cache if the metadata version hasn't changed and then adds a write conflict to the metadata version key. If another transaction changes something and updates the metadata version key, your transaction will fail and then would need to validate the cached entries as needed when retrying (either by re-reading them or through some other mechanism like having hierarchical metadata versions, etc.). Cache updates would probably need to happen after the metadata version has changed and you've decided that a cache entry is no longer valid.

If you have concrete examples of things that are difficult to implement currently, I'd be interested to see those, as I'm sure they would be helpful to anybody that wants to pick up this work. I'll also loop in @alecgrieser as I imagine he has experience with this in the record layer.

Dec 09 '19 19:12 ajbeamon

If you have concrete examples of things that are difficult to implement currently, I'd be interested to see those, as I'm sure they would be helpful to anybody that wants to pick up this work.

Well basically all of what you explained make this not really worth it imho: the implementation has to handle so many edge cases, and I don't think I would be able to validate it to be 100% safe, which is what I want in the first place: have a strong base to build caching in other layers relying on the DL.

For the moment, I'm only exposing a single method TryOpenCached that uses a dedicated cache of absolute path to directory subspace instance, based on the global \xff/metadataVersion key. The cache currently discards everything whenever the value changes globally, which is not really efficient if every other layer is bumping the version frequently... I also added a mutual exclusion check between methods working from the cache, and methods doing mutations in the tree (by tagging each transactions), because I'm afraid of them mixing badly. Mixing TryOpen / TryOpenCache is ok since they don't interfere, but they could return a different result!

My next goal would be to have a way to not throw out the entire cache whenever \xFF/metadataVersion changes, but right now there are no keys in the entire DL subspace that can be used for a quick check (hence why right now I'm throwing everything out). The easiest solution would be to add a new metadataVersion key in addition to the currently existing version key, which would be a versionstamp updated on every mutation (create, delete, move, rename, layer id change ...), but the implementation would still need to re-read these version keys for all the directory partitions that have been traversed (we use partitions a lot!). This would at least result in throwing the cache only for changes in the DL itself, instead of from any other layer in the entire DB. Not great either, but we don't have currently have a layer that has to frequently mutate the DL, except during install or upgrade of the application.

If I want to re-validate each cached subspace individually, I need to have a version key per directory, and also revalidate the entire chain up to the root. Could be done with issuing all the gets in parallel, but that means that each directory subspace instance has to have a reference on all its parents, accross partitions. The current code does not exactly track that... I left tonight in the middle of that refactoring, but was still unsure if this was the proper approach anyway, because it still require a breaking change to the Directory Layer which would probably need to be coordinated accross all bindings...

Lastly, even with a very simplistic cache like this, the cpu overhead for the cache starts to become non-trivial... And if the cluster sees a lot of metadataVersion bumps, I'm afraid that the cost of re-validating the cache by reading a hierarchy of version keys would be identical or even higher than simply re-reading everything on every transaction.

On the other hand, having a newer and more modern implementation of the Directory Layer, where the looking of a directory with a complex path would be done in a single query (instead of currently going down from the root with sequential queries) would probably greatly reduce the importance of caching...

How to deal with the metadataVersion key across multiple layers?

One specific issue I'm unsure how to solve: I have to read the metadataVersion key to validate my cache context. If I read the key after some other layer has bumped it (in the same transaction), it will fail (cannot read the key that was modified with a versionstamped operation). If I read it in snapshot isolation then it will work... but I will get the initial value and won't know that it has been changed locally... I could maybe determine that the DL itself is not the one that changed it (already tracking that per transaction) so that the cache would only need to know the value at the start of the transaction...

But then what about a second layer wanting to do caching, but running after the Directory Layer changed something? It wouldn't be able to read the "new" metadataVersion either and wouldn't be able to check its own cache... or maybe every layer is supposed to only look at the snapshot value, and always avoid doing cached reads and schema mutations in the same transaction?

What is a directory subspace, anyway?

Another maybe less important question but where I don't have the answer:

Is a "directory subspace" considered to be fixed path "/foo/bar/baz" that designate a logical place in the keyspace, and for "logistic" reason is mapped to a key prefix that can change from time to time (delete / re-create) until we get redwood/prefix compression?
Or the reverse, where a "directory subspace" is a constant and small - but opaque - key prefix for a physical place in the keyspace, with a easier path (that can change from time to time) tacked onto it to make our live better... until we get redwood/prefix compression?
And once we get redwood/prefix compression, why can't I just use ("Foo", "Bar", "Baz",) as a prefix to all my keys and forget about the directory layer ? :)

Simply put: is a directory subspace the concept of "/Foo/Bar/Baz"? or the concept of every key that starts with 0x42? or a concept that will be soon rendered obsolete?

Dec 09 '19 22:12 KrzysFR

The easiest solution would be to add a new metadataVersion key in addition to the currently existing version key, which would be a versionstamp updated on every mutation (create, delete, move, rename, layer id change ...), but the implementation would still need to re-read these version keys for all the directory partitions that have been traversed (we use partitions a lot!).

I suspect some of the trade-offs are going to depend on your use case. If you are expecting an infrequently modified directory layer, then this single key approach seems pretty reasonable to avoid being too affected by other uses. You could imagine not having one of these keys in each directory partition and only setting it at the global root. From that perspective, adding one to the directory partition seems like a potential optimization to further limit the scope of any modification. Applying one to every directory is likewise a similar optimization. I'm not sure to what degree its worth setting and checking versions at different levels -- there's probably benefit to some degree but maybe not everywhere?

One specific issue I'm unsure how to solve: I have to read the metadataVersion key to validate my cache context. If I read the key after some other layer has bumped it (in the same transaction), it will fail (cannot read the key that was modified with a versionstamped operation). If I read it in snapshot isolation then it will work... but I will get the initial value and won't know that it has been changed locally... I could maybe determine that the DL itself is not the one that changed it (already tracking that per transaction) so that the cache would only need to know the value at the start of the transaction...

I was thinking about this too in my less-performant suggestion that the cache is invalidated within a transaction if it's been modified. It made me wonder if there was some value in having a portion of key-space that a transaction could write into but that wouldn't be committed, such that users of a single transaction could coordinate on some state by writing and reading it. Concurrency guarantees would be pretty limited, but maybe it would be useful.

But then what about a second layer wanting to do caching, but running after the Directory Layer changed something? It wouldn't be able to read the "new" metadataVersion either and wouldn't be able to check its own cache... or maybe every layer is supposed to only look at the snapshot value, and always avoid doing cached reads and schema mutations in the same transaction?

If two layers are operating independent caches, they could probably be made to work by doing what you described for the DL case -- checking the starting metadata version and some extra per-layer state to indicate whether the layer has changed within a transaction. In this case it wouldn't matter if the one layer changed something in its purview because it shouldn't affect the other.

What is a directory subspace, anyway?

I tend to think of a directory subspace as equally a directory and a subspace. The subspace aspect of it means that it represents some prefix applied to all keys packed within it. The directory part indicates that it lives within this hierarchical path indirection scheme.

Both aspects are useful. Having a common prefix means that the keys will be located contiguously. The directory indirection allows you to both move directories within your hierarchy (not possible if the prefix is the path unless you also move contents) and be more expressive with your path names. It's this last point that is largely made moot by redwood, though there still exist places where these keys aren't compressed that might need to be addressed too, such as when sending keys and values over the wire in response to a range read.

Dec 09 '19 23:12 ajbeamon

I guess the simplest change that could be done to have a semi-efficient cache would be to add a versionstamped key to each directory partition, that would be changed with every mutation within that partition.

Each partition could then have a cache that, whenever the global metadataVersion changes, would have to re-get the value of its own key and all its parent partitions (up to the root) to decide whether to keep or throw out the cache.

Any change made in a partition would throw out the cache for that partition, and if you are using partitions as a mechanism to do multi-tenancy, this should work in keeping a noisy tenant from wiping the caches of the other partitions, with the only cost being a multi-get of a few keys. In practice this would probably be involve concurrently reading a couple of keys (only one for people that don't use partitions).

Since we have to also check the top partition, anytime a new tenant is created (new partition created under the root), it would still wipe all the caches, because we don't have enough granularity to decide what changed at the root. Maybe we could decide to not bump the version when creating new folders? I'm not sure if there are any ABA scenarios were that would cause problems?

Adding and maintaining such a key per partition is breaking change, requiring to bump the DL version to at least "1.1", and would need to be coordinated across all bindings... Is this something that is reasonable to do? What are the implications of upgrade/downgrades?

I was thinking about this too in my less-performant suggestion that the cache is invalidated within a transaction if it's been modified. It made me wonder if there was some value in having a portion of key-space that a transaction could write into but that wouldn't be committed, such that users of a single transaction could coordinate on some state by writing and reading it. Concurrency guarantees would be pretty limited, but maybe it would be useful.

Since I need this already, I implemented it at the binding layer. I added a simple synchronized Dictionary<Type, object> to each transaction instance, that can be used by any layer or app code via a simple tr.Context.GetLocalData<SomeLayerState>() and tr.Context.SetLocalData<SomeLayerState>(...) API. This make it easy to plug simple flags, values, list of keys, hashmaps, or even complex object graphs, without needing to serialize them into binary. I'm debating on whether this state needs to persist across retries/reset or not. Maybe let the layer decide with a flag.

I'm pretty sure that each binding could find the best way to implement such a "bag of state" using whatever mechanism works best for the language, and is retro-compatible with previous versions of fdb.

But that does not mean that having a special AddNoCommitRange(...) API could help in some scenarios? Maybe this could help with the elusive case of using a transaction that starts with a clearRange('', '\xFF') and never commits to use simplify using FDB with concurrent unit tests?

If two layers are operating independent caches, they could probably be made to work by doing what you described for the DL case -- checking the starting metadata version and some extra per-layer state to indicate whether the layer has changed within a transaction. In this case it wouldn't matter if the one layer changed something in its purview because it shouldn't affect the other.

Our typical use cases if that a higher level Layer would need to have its own state in cache, but would be forbidden to cache directory subspaces instances. Instead, it would have to re-get the directory subspaces on every transaction, relying on the fact that they would be properly cached by the DL to be efficient.

This probably means that, more generally, any Layer at "level" N can cache its own stuff but MUST NOT cache state obtained from Layer N-1, instead calling the cache of that level everytime. And likewise, it would have to provide efficient "GetFooCached(...)" methods so that layer at level N+1 can also implement its own cache on top of Layer N.

Anytime a developer breaks this rule, the caching is potentially broken.

To help track these kind of bugs, I'm thinking of adding a remote self-destruct feature to DirectorySubspace instances, so that if they were created from a cache context, and that context is invalidated, then any later attempt to use the instance to encode/decode keys would throw a retryable error.

Implementors of intermediate layers would also need to implement such a safety system in their cachable resources to help higher level layer / app developers from shooting their own foot.

though there still exist places where these keys aren't compressed that might need to be addressed too, such as when sending keys and values over the wire in response to a range read.

This I think is important, and should probably be addressed when redwood goes mainstream. Maybe having some sort of very fast compression scheme at the transport level (lz4? lzo? zstd-1?) could solve the network overhead pretty simply.

At the binding level, maybe having a way to pass a key composed of multiple segments would help reduce memory overhead at the app level. Maybe also a similar scheme could be done when decoding the result of fdb_future_get_keyvalue_array?

Dec 10 '19 17:12 KrzysFR