cortex Compactor can fail with "block with not healthy index found ... series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)" message

Compactor can fail to compact block with message like this:

msg="failed to compact user blocks" err="compaction: group 0@8712473450002685162: block with not healthy index found /data/compact/0@8712473450002685162/01EJEXEW6XQ37G17Q4JH9M2KF1; Compaction level 1; Labels: map[__org_id__:...]: 1/457844 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"

When this happens, compaction for given user will not continue, because compactor will retry to compact this block over and over, failing each time.

Upon further investigation, this is a 2h block produced by ingester. It's not clear why out-of-order chunks would be written. This is bug likely in Prometheus TSDB code.

Similar bugs in Thanos:

https://github.com/thanos-io/thanos/issues/3442
https://github.com/thanos-io/thanos/issues/267

Workaround is to rename the block so that it's not included in the compaction.

Dec 04 '20 12:12 pstibrany

We recently hit this issue as well.

Feb 03 '21 19:02 alvinlin123

We recently hit this issue as well.

Could you paste the exact log error you've got?

Feb 04 '21 17:02 pracucci

caller=compactor.go:450 component=compactor msg="failed to compact user blocks" user=<redacted> err="compaction: group 0@16811904347059316647: block with not healthy index found /data/compactor/compact/0@16811904347059316647/01EPTTT4B3FXVQ5X7WX5XZA13K; Compaction level 1; Labels: map[org_id:<redacted>]: 1/1000000 series have an average of 1.000 out-of-order chunks: 0.000 of these are exact duplicates (in terms of data and time range)"\n"

Feb 04 '21 20:02 alvinlin123

I am wondering, can this issue be caused by https://github.com/prometheus/prometheus/issues/8055 ? Because #8055 seems to introduce out of order samples.

Feb 09 '21 10:02 alvinlin123

I am wondering, can this issue be caused by prometheus/prometheus#8055 ? Because #8055 seems to introduce out of order samples.

It shouldn't. What you got is chunks of order within the same block while the issue you linked is about 2 different blocks overlapping in time.

Feb 10 '21 17:02 pracucci

@pracucci and @pstibrany for this issue, do you think it would be an improvement if we change the compactor not to halt the whole compacting process when a level1 block is bad; given that there is replicas of that block.

Apr 26 '21 18:04 alvinlin123

Hi,

We also encountered this issue with the following message on our compactor :

Nov 18 16:29:56 cortex-compactor-1 cortex[8363]: level=error ts=2021-11-18T16:29:56.584539428Z caller=compactor.go:531 component=compactor msg="failed to compact user blocks" user=fake err="compaction: group 0@5679675083797525161: block with not healthy index found /var/lib/cortex/data/compact/0@5679675083797525161/01FM53C7H2SR3N111QZMA3TK8P; Compaction level 1; Labels: map[org_id:fake]: 13/20459573 series have an average of 1.000 out-of-order chunks: 1.538 of these are exact duplicates (in terms of data and time range)"

If it is any help, you can find the content of the not healthy block here : https://dl.plik.ovh/file/snWjOJ69TkrmdcUN/D8sbecutaDZBVAsN/01FM53C7H2SR3N111QZMA3TK8P.tar.gz

Regards,

Julien.

Nov 19 '21 13:11 bubu11e

A better workaround for this issue is to mark the block as no-compact.

$ cat thanos.yml
type: S3
config:
  bucket: my-bucket
  endpoint: ...
prefix: tenant-id
$ thanos tools bucket mark --id=$BLOCK --marker=no-compact-mark.json --objstore.config-file=thanos.yml --details=buggy

May 25 '23 11:05 friedrichg