wagtail-cache Prevent duplicate entries in uri

This prevents duplicate entries in the uri_keys lists for the items in the keyring.

Jan 17 '24 22:01 aweakley

Interesting... just confirming what is happening here. So when a cached page expires, it is still present in the keyring (until the keyring expires). Logically in that situation it would return an error if the key is in the keyring, but the key no longer exists.

Is that what you are experiencing?

Jan 17 '24 22:01 vsalvino

This is what we're seeing:

visit the homepage,
- homepage response is cached,
- keyring updated to include homepage cache key
(some time later) visit another page
- other page response is cached,
- keyring is updated to include the other page's cache key, but the homepage's one is left there untouched
- by saving the keyring again, it's given a fresh timeout
(some time later, after the homepage response item has expired from the cache, but the keyring has not), visit the homepage again
- it's a cache miss, so the homepage is regenerated and the response cached
- the keyring (which has not yet expired) still contains the homepage's original cache key, but the code adds the same key to the list again

Jan 17 '24 23:01 aweakley

Thanks for the explanation. I think you may have unveiled a bigger problem... the keyring is probably wildly out of date with what is actually expired in the cache. Since we never actually remove anything from the keyring, it probably diverges pretty fast from what is actually in the cache.

For example, if they keyring expires, it would be created anew. However there are probably many valid cache entries which have not yet expired since the keyring expired. And vice-versa.

Luckily, I don't think we use the keyring during the request/response cycle, it is mainly used for the UI or for clearing individual URLs from the cache manually.

Any ideas?

Jan 18 '24 00:01 vsalvino

I was wondering if we should clean up the keyring each time we come to add something to it, but that felt like it might be too expensive during the request/response cycle. How about a cleanup management command that can be run as needed? In our case the keyring is large and so we're sending a lot of data back and forth to redis, which is why I came to be looking at this.

Jan 18 '24 00:01 aweakley

Alternatively, rather than having one single keyring with everything in it, how about having a separate item for each cached url? Then they'd each have their own expiry time, which would be close to the matching cached response's expiry time (at least for the last matching response that was cached)

Jan 18 '24 05:01 aweakley

That's a good point about the keyring adding performance overhead (shuffling it back and forth to update it each time).

Rather than maintain a keyring at all, it would be better if we could "query" the contents of the cache in situations where you want to see a list of URLs etc. The query would be more expensive, but it is only used when viewing the contents of the cache in the UI, or when purging a URL based on regular expressions.

I'm not sure how to do this though. Might need to research how other caching systems handle this.

Jan 18 '24 16:01 vsalvino

I think it might be better to use the database to store the combination of URL, cache key and timeout. That would make it easy to get the cache keys for a given URL(s) or URL prefix. Then we don’t need to worry about which cache backends might support some sort of search or wildcard lookup.

Expired items could be purged as needed with a management command, but it doesn’t matter too much when that happens because the queries can exclude them.

Jan 20 '24 06:01 aweakley

I think the database is actually a good idea. It would only need to be written when updating the cache, so the performance hit is reasonable. However I would want to make sure it is automatically purged (maybe on write), otherwise I can guarantee some sites will have gigabytes of forgotten cache entries in the database.

If we could achieve a similar behavior using a cache entry/keyring instead, that might be preferable, as it will be automatically purged. Kind of like the "single-entry" solution you mentioned above. Although that then presents the issue of the UI, when showing the list of all cached entries. Unless we can query the cache using a key prefix.

Jan 22 '24 18:01 vsalvino

I was worried about the speed of automatically purging during the request/response cycle if there are lots of things to purge. But I guess if it’s done each time an item is put into the cache then it will be frequent enough to never get too slow. I've certainly seen Django's session table get large when its management command hasn't been run.

On the cache query approach, I think it may be possible to do a key-prefix query in Redis with something like keys although that comes with a fairly stern warning about using it in production. Django’s internal backends would need to be extended though.

I’ve looked at the file-based one, and that would be difficult because its _key_to_file method uses an md5 hash. So we couldn’t derive the keys from the list of files on disk and so we’d still need to keep a registry of those keys somewhere.

So I think the database way would be simpler to implement.

Jan 23 '24 03:01 aweakley

I've had a go at that in https://github.com/ixc/wagtail-cache/tree/keyring_in_database and we're testing now to see how it goes.

Jan 26 '24 00:01 aweakley

Prevent duplicate entries in uri_keys lists