docs.rs Cache dependencies between crates published by the same owner

Right now, when a large workspace publishes its crates, we spend a lot of time rebuilding the same dependencies over and over again (e.g. https://github.com/aptos-labs/aptos-core). To reduce the load on docs.rs and speed up builds, we can cache those dependencies to avoid rebuilding them.

We are only planning to do this between crates owned by the same crates.io owner for security reasons; see https://discord.com/channels/442252698964721669/541978667522195476/996125244026794134 for a discussion of how this causes issues between crates. In practice, this should still be a giant speed up since most of our build times come from large workspaces.

In order to have deterministic time limits, we are going to count the time spent to compile a dependency against the time limit, even if we already have the dependency cached locally. To avoid running out of disk space we'll use an LRU cache that deletes crates when we have less than ~20 GB free on the production server. For reproducibility we are going to make the cache read-only, and wipe the target/doc directory after each root crate we build.

Here is the rough approach we're imagining:

Run cargo check --timings=json -Zunstable-options -p {dep} for each dependency of the crate we're building
If we can parse the JSON output, record the timings in a database. Otherwise, make a note to wipe the cache for everything we were unable to parse after this build finishes.
Run chmod -w on everything in target (except for the top-level directory, so cargo can create target/doc)
Subtract the timings of all dependencies from the time limit (even if some were already cached). If we were unable to parse the JSON, subtract the entire time needed in step 2.
Run cargo doc and upload to S3.
Delete target/doc.
repeat 1-6 for all new crates.io crates published by the same owner.
Wipe the cache and the database table the next time we call rustwide.update_toolchain, since the new nightly will be unable to reusse the cache.

Jul 11 '22 19:07 jyn514

Run chmod -w on everything in target

Tthis sounds like cargo check for each dependency ends up collecting everything we need for rustdoc. Also assuming we can make rustdoc work with /target/ dependencies readonly except the output directory.

Shouldn't this prevent any possible malicious changes that the crate author can do? So we actually could cache the dependency for all owners?

Also the --timings=json logic and calculation feels like quite some complexity "only" for deterministic time-limits.

Jul 16 '22 16:07 syphar

The most aggressive attack that @pietroalbini mentioned was:

We build some crate foo under control of the attacker
- This has a build script that generates a bar_macro.rlib and associated files into the target-dir to make it look like bar-macro has already been built.
We build docs for another crate bar which depends on a proc-macro bar-macro
- Cargo sees that (the attacker generated) bar_macro.rlib is up-to-date and uses it, now the attacker has code running in the rustc process during the build for another crates docs and can essentially do anything it wants to those docs.

Jul 16 '22 17:07 Nemo157

Wouldn't that be prevented by only having /target/doc writable?

Or does cargo check not do everything cargo doc needs and we will add files?

Jul 16 '22 17:07 syphar

Oh, I forgot one indirection in the first step, the crate foo depends on another crate foo2 under control of the attacker. The foo2 build script will be run during the cargo check step while the whole target-dir is writeable.

Jul 16 '22 17:07 Nemo157

( sorry if I'm missing something, these build specifics are new to me )

Oh, I forgot one indirection in the first step, the crate foo depends on another crate foo2 under control of the attacker. The foo2 build script will be run during the cargo check step while the whole target-dir is writeable.

How would that be different with or without caching?

Jul 16 '22 18:07 syphar

Without caching: if foo2 writes libbar_macro.rlib into the target directory, that will be wiped before we build the docs for bar, so there's no way for one build-tree to affect another build-tree (you only have to trust your dependencies, not everything else that has been built by the builder).

Jul 16 '22 20:07 Nemo157

Ah, then I get it. Thank you for spending the time explaining!

Couldn't we keep the cache by top-level dependency and version? Of course this would mean duplicate storage of the output binaries, but the implementation could be quite simple

Jul 17 '22 07:07 syphar

That means the possible process could be (without timings):

for each top-level dependency:

check if we have a cached target directory. If yes, copy it to the future doc-build target, skipping any duplicate files.
if we don't have a cached target, run cargo check dep@vers in an isolated env. Put the output into the cache, & copy it over to the doc-build target

When we did this for all top-level dependencies we can run cargo doc without any permission change since we don't put any output files into the cache. Also if we missed something from the cache, cargo doc will just build it.

Jul 17 '22 07:07 syphar

I think the most effective way would be to just keep the same target directory only when the previous crate was published by the same person as the new crate. So, for example:

foo by @alice, uses a fresh target directory as no build happened before
bar by @alice, reuses the target directory of foo as it's also published by @alice
baz by @bob, removes old target dir and uses a fresh one as it's not published by @alice
foobar by @alice, removes old target dir and uses a fresh one as it's not published by @bob

That would solve the problem of big workspaces being publishes, as I assume they would all be published by the same user, and removes the chance of someone else sneaking something into someone else's docs.

Jul 17 '22 08:07 emilyalbini

Ah, so this would also depend on the queued order of the crates?

Jul 17 '22 08:07 syphar

So it would only help specifically for these bulk crates that are quickly released after each other, and not for any other crates.

Jul 17 '22 08:07 syphar

We have a hard-limit of max 1 day of caching because of updating the compiler. There are probably plenty of cases where a few related crates are published throughout the day which we would miss, but the really bad situations are when we have the mass publishes of workspaces with very deep dependency trees.

Jul 17 '22 09:07 Nemo157

how would that look like with multiple owners? treat them as one (unique) new owner/team?

Jul 17 '22 12:07 syphar

We can get the actual publisher of each release from the crates.io API (even for crates with multiple owners, if they're publishing a workspace at once I'd assume a single account will be used for all the crates).

Jul 17 '22 12:07 Nemo157

FWIW this is painful enough I would be ok with omitting the "record the timings in a database" bit for now if it lets us get the caching in sooner, it can only have false positives (your crate intermittently builds when otherwise it would have always failed), not false negatives (your crate intermittently fails when it should have always succeeded).

Verifying the crates are owned by the same owner is not something we can omit though, for security reasons.

Mar 09 '23 17:03 jyn514

I'm with you on both points, the first one we also already talked about, the second was clear to me.

What's slowing me down is mostly the iteration speed / debugging with docker on my mac, since I have to go through the web container for everything. On top of the fact that the build process & rustdoc are definitely the parts I know the least, so much trial & error is needed.

While improving the dev experience is certainly possible, currently I'm trying to invest time into artifact caching here, after the CPU load problem was worked around & a possible fix is in a PR.

Mar 09 '23 18:03 syphar

@syphar yup definitely, I wasn't trying to imply you weren't doing enough work ❤️ for context @alice-i-cecile was asking how she could help speed up the build queue and I wanted to make sure the issue was up to date since she hasn't been sitting in on our meetings.

Mar 09 '23 18:03 jyn514

Yep, the Bevy community was curious about how we could help, and I was pointed here <3

Mar 09 '23 18:03 alice-i-cecile

@syphar yup definitely, I wasn't trying to imply you weren't doing enough work ❤️

I wasn't trying to implying that you implied ;)

I myself feel slow when I'm working on the build-process, also visible when I want to dig into the other bigger build-process topic (adding builds into the DB before the build)

for context @alice-i-cecile was asking how she could help speed up the build queue and I wanted to make sure the issue was up to date since she hasn't been sitting in on our meetings.

ah, thanks for the explanation!

If needed, I can create WIP PR + some explanation where I am right now, if @alice-i-cecile wants to help out.

Mar 09 '23 18:03 syphar

Sure, that sounds lovely :) I don't personally have a ton of time or expertise here, but I'll take a look and share it around the community: we have a lot of talented folks and care about improving things for the whole ecosystem.

Mar 09 '23 18:03 alice-i-cecile

I created the draft PR with what I already had: https://github.com/rust-lang/docs.rs/pull/2079

Mar 09 '23 20:03 syphar