Compactor can fail with "block with not healthy index found ... series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)" message
Compactor can fail to compact block with message like this:
msg="failed to compact user blocks" err="compaction: group 0@8712473450002685162: block with not healthy index found /data/compact/0@8712473450002685162/01EJEXEW6XQ37G17Q4JH9M2KF1; Compaction level 1; Labels: map[__org_id__:...]: 1/457844 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"
When this happens, compaction for given user will not continue, because compactor will retry to compact this block over and over, failing each time.
Upon further investigation, this is a 2h block produced by ingester. It's not clear why out-of-order chunks would be written. This is bug likely in Prometheus TSDB code.
Similar bugs in Thanos:
- https://github.com/thanos-io/thanos/issues/3442
- https://github.com/thanos-io/thanos/issues/267
Workaround is to rename the block so that it's not included in the compaction.
We recently hit this issue as well.
We recently hit this issue as well.
Could you paste the exact log error you've got?
caller=compactor.go:450 component=compactor msg="failed to compact user blocks" user=<redacted> err="compaction: group 0@16811904347059316647: block with not healthy index found /data/compactor/compact/0@16811904347059316647/01EPTTT4B3FXVQ5X7WX5XZA13K; Compaction level 1; Labels: map[org_id:<redacted>]: 1/1000000 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"\n"
I am wondering, can this issue be caused by https://github.com/prometheus/prometheus/issues/8055 ? Because #8055 seems to introduce out of order samples.
I am wondering, can this issue be caused by prometheus/prometheus#8055 ? Because #8055 seems to introduce out of order samples.
It shouldn't. What you got is chunks of order within the same block while the issue you linked is about 2 different blocks overlapping in time.
@pracucci and @pstibrany for this issue, do you think it would be an improvement if we change the compactor not to halt the whole compacting process when a level1 block is bad; given that there is replicas of that block.
Hi,
We also encountered this issue with the following message on our compactor :
Nov 18 16:29:56 cortex-compactor-1 cortex[8363]: level=error ts=2021-11-18T16:29:56.584539428Z caller=compactor.go:531 component=compactor msg="failed to compact user blocks" user=fake err="compaction: group 0@5679675083797525161: block with not healthy index found /var/lib/cortex/data/compact/0@5679675083797525161/01FM53C7H2SR3N111QZMA3TK8P; Compaction level 1; Labels: map[org_id:fake]: 13/20459573 series have an average of 1.000 out-of-order chunks: 1.538 of these are exact duplicates (in terms of data and time range)"
If it is any help, you can find the content of the not healthy block here : https://dl.plik.ovh/file/snWjOJ69TkrmdcUN/D8sbecutaDZBVAsN/01FM53C7H2SR3N111QZMA3TK8P.tar.gz
Regards,
Julien.
A better workaround for this issue is to mark the block as no-compact.
$ cat thanos.yml
type: S3
config:
bucket: my-bucket
endpoint: ...
prefix: tenant-id
$ thanos tools bucket mark --id=$BLOCK --marker=no-compact-mark.json --objstore.config-file=thanos.yml --details=buggy