trillian-examples icon indicating copy to clipboard operation
trillian-examples copied to clipboard

ctclone Fails with Incomplete CT Log Responses

Open tired-engineer opened this issue 2 years ago • 5 comments

Issue Description

The ctclone tool encounters failures when Certificate Transparency (CT) logs return fewer entries than requested. This issue becomes problematic during periodic ctclone operations or at the ends of the logs. At certain points, the position in the cloned log may align in such a way that the CT log returns less than the expected number of entries.

Specific Example

An instance of this issue can be observed with Sectigo's Sabre 2025h2. When a "negotiated" page size of 256 is used, the log returns only 165 entries for a specific range. This response unexpectedly brings the position to 52992 (207 * 256), causing ctclone to fail with the following error:

worker 0: Retryable error getting data: "LeafFetcher.Batch(52827, 256): wanted 256 leaves but got 165"

Additional Context

Further details about this behavior in CT logs can be found in the discussion on Let's Encrypt's forum regarding the 'coerced get-entries' feature.

Important Note

The proposed solution covers only the cases where the actual log's page size is a multiple of the requested batch size. It is assumed that the negotiated batch size requested by ctclone aligns with this multiple. If the actual page size does not conform to this pattern, additional adjustments may be required.

Edited: formatting.

tired-engineer avatar Dec 27 '23 09:12 tired-engineer

Thanks for the well written bug report and the attempted fix in #1045. This sounds like it should be an easy fix, but I think that appearance is deceptive. I wasn't aware of any logs trying to coerce alignment as specific indices, and the design of the cloner doesn't lend itself well to readjusting to these hints; the current design does a very greedy strategy in the batch.go Bulk method where it starts goroutine workers with statically computed ranges to fetch.

I think the BatchFetch mechanism will need some way to elegantly report up that it only got partial results. The cloner will then need to cancel other fetches in progress and restart fetching again, using a start index based on the last result fetched from the incomplete range. To give confidence that this works, we'd probably want to add some kind of functional test that emulates a log performing this coercion.

I'm open to receiving PRs for this. If none are forthcoming I'll try to make time for this but it's unlikely to happen in the next couple of weeks, at least.

mhutchinson avatar Jan 03 '24 14:01 mhutchinson

I agree, got actually caught up in that "deception" and still have not fixed an issue.

Another option could be to run at first just one request, check it for the alignment and spin workers accordingly received result. In the approach you offer I think it's important to also implement the cleanup of cancelled or finished fetchers, as it might break the following insertion transactions.

What do you think would be the preferred way to approach the issue?

tired-engineer avatar Jan 06 '24 13:01 tired-engineer

I like your proposal to have special-casing only on startup. Once a full chunk has been fetched, then it can enter full parallel mode as it currently does, and die on incomplete chunks.

The reason I like this is:

  • it's simpler than trying to realign mid-run
  • if the whole process is run by some orchestrator (e.g. docker) then this can be configured to restart the process on death. The events of dying/restarting will fix any transient alignment problems when the startup special casing reruns

I think this approach is worth investigating. It seems strictly better than the current behaviour. Is this something you'd be able to contribute, @tired-engineer ?

mhutchinson avatar Jan 08 '24 17:01 mhutchinson

Hey, @mhutchinson,

Thanks for getting back to that issue and for the feedback. I will give it a try this week, hopefully will be able to contribute.

tired-engineer avatar Jan 08 '24 21:01 tired-engineer

Sent the first implementation for review, a bit later than expected. I think it covers more or less everything we've discussed on this thread.

tired-engineer avatar Jan 16 '24 12:01 tired-engineer