Add filesystem cache implementation
Commit Message: Add filesystem cache implementation Additional Description: This is a work in progress; it functions as a cache in this state, but lacks LRU removal and repopulating the cache on restart. Risk Level: None - new extension so it doesn't do anything if not activated. Testing: Passes the cache test. Additional coverage pending. Docs Changes: Adds docs for configuring the new cache implementation. Release Notes: n/a, still WIP. Platform Specific Features: Uses the AsyncFile library, therefore no Windows support yet.
As a reminder, PRs marked as draft will not be automatically assigned reviewers, or be handled by maintainer-oncall triage.
Please mark your PR as ready when you want it to be reviewed!
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy/|docs/root/api-docs/).
envoyproxy/api-shepherds assignee is @htuch
CC @envoyproxy/api-watchers: FYI only for changes made to (api/envoy/|docs/root/api-docs/).
Sorry about force-push - I have this as a two-branch chain in my workspace (add the API, add the impl), so moving the API in the first branch was an insertion of a commit from the perspective of this branch.
CC @envoyproxy/dependency-shepherds: Your approval is needed for changes made to (bazel/.*repos.*\.bzl)|(bazel/dependency_imports\.bzl)|(api/bazel/.*\.bzl)|(.*/requirements\.txt)|(.*\.patch).
envoyproxy/dependency-shepherds assignee is @htuch
/wait
However I would like to review a design doc first and then look at code.
I don't think there's really all that much to design at this stage, it's essentially a minimum viable product of a filesystem cache at this point, but I do have a half-assed design doc lying around anyway! https://docs.google.com/document/d/1DWy1TE8RfV_0_fWs28DARS-GblUNuSRBqFxagOi6QPQ/edit?usp=sharing
In particular, your choice to punt on eviction for now makes me wonder who would use this instead of the simple cache? I assume you can't use a non-evicting cache in production either.
It's not "punt on eviction and start using it", it's "punt on eviction so that we can review the part of it that's implemented and make the eviction part a separate review because this is already a pretty huge thing for review." The eviction implementation plan (already in progress) is to start with a very simple (maybe too simple) LRU eviction that just does a linear scan over the cache hash table to find the least recently used entry when an eviction is necessary. If it turns out that provokes a bottleneck there are several other options I have in mind, but there's a decent chance that something more complicated would actually be slower than the very simple option in practice, and it certainly would be harder to test and fully understand.
who would use this instead of the simple cache?
Someone with an instance of a proxy that's more memory constrained than disk-space constrained.
I'm not wedded to any of the design decisions we made in 2010 when building the pagespeed file cache
Good news, it seems like there's a significant amount in common - e.g. cache metadata in memory (and partially duplicated to the disk cache).
Thanks for the doc; I feel that a lot of the decisions made in the non-evicting version need to be highly informed about how you will do eviction. I'll try to read in the next day or so.
Thanks for the doc; I feel that a lot of the decisions made in the non-evicting version need to be highly informed about how you will do eviction. I'll try to read in the next day or so.
There are additional structures that would be necessary to accommodate different eviction methods, but I don't think any of them would really change this core, only build on top of it. It's gonna want those O(1) lookups regardless of any secondary indexing method.
Thanks I read through the doc and left comments and would like to iterate on that a bit before code. There really are a lot of options in front of you even before you actually implement eviction :)
/wait
I'm still hoping to converge fully on the doc before diving into the code :)
/wait
what's up with the "CODEOWNERS" file?
Hopefully it's just that I got my name wrong (in most places someone else hasn't taken my name first!) Updated.
Is there a way to break this up into smaller PRs, maybe bottom up cover each class and its tests?
Not really, not in a way that makes any sense. Unit-testing helper classes like the sub-classes here would only create a bunch of churn with no value. e.g. when I rewrote this from using a memory cache for lookups to not, it would have required rewriting a whole mess of tests... and also the first two PRs would have been like "what is this even for?" to the point that the rest of the chain would still have to be there to make sense of them, and then they would have ended up discarded only after the whole chain was written, and then a whole bunch of new zero-value tests would have to be written for the new helper-abstractions, and then all the git-wrangling of trying to do stuff with a chain of PRs where the first ones got replaced. And any such tests would essentially just be "did the code do what the code did" instead of "did the code do what it is supposed to do", like the test for the example cache that ended up incorrectly asserting that a cache must do the incomplete thing. The tests at the filter level are the lowest-level tests that have any value.
I would have preferred to have broken it into smaller pieces, but the first piece (just API) can't pass CI because there's validation that any API must have an implementation, and then you can't work bottom-up from there because the implementation of the API is the top piece. I am not a fan of this model precisely because it essentially demands either monolithic PRs, or small PRs that are initially deadcode and make no sense on their own... but we must work with the tools we have.
what's up with the "CODEOWNERS" file?
Hopefully it's just that I got my name wrong (in most places someone else hasn't taken my name first!) Updated.
Nope, it's just like that. It looks like most entries in CODEOWNERS have the same problem, there's a lot of red squiggles in that view. Figuring this out is above my paygrade, as someone who apparently doesn't exist.
I thought I'd add Todd to the review for caching semantics. Todd -- I shared the doc with you also. You might want to look at that first.
a bottom-up impl broken into more PRs
In order to do that I'd have to write bundles of essentially pointless test cases, because otherwise the small PRs wouldn't be able to pass CI due to coverage constraints (which constraints are overcome by the meaningful test cases further up the stack).
If you really think such bundles of pointless test cases would have value then I can grudgingly do it that way. It also makes things awkward because git is not good at "this depends on two branches that don't depend on each other" modeling, which would mean either, e.g. cache_file_fixed_block would have to depend on #23717 even though it doesn't have any connection, or a lot of painful manual wrangling of inverted branch dependency where the dependency actually occurs.
time(CodeReview) = K*2^(# Lines in PR) seems like a very strange approximation - adding one line doubles the review time? And there would also have to be like 500 additional lines of pointless tests to make the smaller PRs pass CI. Surely just reviewing one file at a time wouldn't be that different from doing the same thing in separate PRs?
RE pointless tests: I think in general we want unit tests for classes in addition to the mid-level tests that show them in context. It's much easier to hit the corner-cases of a lower-level API directly.
RE 2^N formula: it's just an approximation: probably it doesn't double from 1 line to 2 line. But the exponential observation is just that I rarely get a solid block of time where I can focus on reviewing a huge PR. What happens is I get 10 minutes here, 15 minutes there, in between other things. So if I can get through the whole thing in 20m I can do it reasonably efficiently. But if I have to come back and pick up where I left off then typically I have to re-review earlier parts to get context.
However if I have a top-down spec and review the bottom layer first, I can understand the context in which it will be used, but then just rip through the dead simple implementation and its simple tests :) Then the layers on top of that will also be simple.
RE git limitations: yes I agree. Actually I'm not good at advanced source control and I wind up doing a lot of patching and diffing to cobble together pieces from various things. Usually what I like to do is start with a complete implementation which is impossible to review. Then I peel off pieces bottom up and get them reviewed in a different git client(s). Once those merge I do the git pull and merge it in and the weight of that PR decreases over time, at the cost of me having to resolve conflicts if there were changes required by the reviewers. But the benefit is that the pieces are each reviewable (hopefully).
/wait
marking as such as when the constituent smaller PRs land this will either be obsoleted by them or become simple enough to easily review.
/retest
Retrying Azure Pipelines: Check envoy-presubmit isn't fully completed, but will still attempt retrying. Retried failed jobs in: envoy-presubmit
Looks like there are still a few CI issues
/wait