Improve convert retry handling
Our current retry logic for converting documents (shelling out to LibreOffice) is based on two constants: the number of retry attempts and the timeout https://github.com/alephdata/ingest-file/blob/fca65fbb08ff37d65df3c14804ad5b1b6809b97d/ingestors/support/convert.py#L16-L17
What would be more desirable is a faster first fail which could be increased to a maximum.
For instance: right now we retry up to 5 times and timeout after 3600s (1 hour). We could potentially get much better throughput by having a first timeout after 600s (10 minutes) which gets progressively larger (with a potential max cap). To illustrate:
TIMEOUT_START=600
TIMEOUT_INCREASE=900
TIMEOUT_MAX=3600
CONVERT_RETRIES=5
This would result in up to 5 retries with timeouts of 10, 25, 40, 55 and 60 minutes. Ideally "stuck" convert tasks would time out much sooner and get queued up for a retry faster.
TODO
- [ ] try get some data on average(and maximum?) time it takes to convert a document
- [ ] make the timeout and retry settings respect their respective settings.
Hey, have a thought regarding START/INCREASE/MAX variables. I do like the way it works in retry and requests libraries - it has a backoff parameter as a float number which indicates speed of growth of interval between attempts.
I am not really deeply into the way it works in Aleph, however Retry lib has this implemented in a nice manner.