sphinxdocs: implement content-based change detection plugin
Sphinx has change detection to facilitate incremental rebuilding, but it's timestamp based. Bazel doesn't reliably preserve timestamps, nor are timestamps highly reliable, so this functionality isn't usable. This means sphinx has to rebuild everything, every time, which can get quite slow. Pigweed, for example, takes many minutes. Even in rules_python, it takes just under a minute (long enough where I think, "its just building docs, why is this taking so long?")
To fix this, I think we can implement a plugin that uses the env-get-outdated event; see this comment: https://github.com/sphinx-doc/sphinx/issues/11556#issuecomment-1667507177
api docs: https://www.sphinx-doc.org/en/master/extdev/event_callbacks.html#event-env-get-outdated
All it has to do is calculate a hash of the file and compare it to a previous hash.
Looking through those API docs, I wonder if some of those other events would be of interest, especially for a persistent worker.
Some notes as I was working on #2938
Using mtime is hard coded into sphinx and there aren't any hooks to customize it. The only avenue I see would be to monkeypatch sphinx.util.osutil._last_modified_time to return fake mtimes.
The core algorithm for its change detection is in sphinx.environment.BuildEnviront.get_outdated_files. This is basically a loop over docs to get their mtime and check the mtime of dependency docs. It serves as a nice reference for what a hash-based impl should do.
The get_outdated_files method is called by sphinx.builders.Builder.read. This method first computes the mtime-based changed docs, then emits the env-get-outdated event so plugins can add additional docs.
Pigweed has a hash-based prototype based on #2938 and https://pigweed-review.googlesource.com/c/pigweed/pigweed/+/294057/6/docs/_extensions/env_get_outdated.py
The basic way it works is the worker code computes what's changed and write it to $docTrees/digest.json before invoking sphinx. The env_get_outdated.py extension then reads that file and returns the extra paths.
After digging around in the sphinx code, some possible implementations come to mind:
- A bazel-independent extension. This means the extension does its own hashing and stores the hashes somewhere. Where depends, but it looks like the
envis intended to store this sort of arbitrary, extension-specific info, and theenv-purge-docevent can be used to handle when a doc changes. - In the worker, synthesize an extension and add it to
sphinx.application.builtin_extensions. Wire together this and the hash state the worker computes from the bazel request info. The--defineflag could be used to have the worker set a path to e.g. a file with worker info. Or just stick it in a global, and the extension relies on the global being properly set.
In any case, the env and app objects passed to the extension setup and event handler are full of methods that seem relevant, e.g. BuildEnvironment.doc2path