Bug: Healthcheck topic is produced to more than consumed from for Kafka+Plugin server health check

Open fuziontech opened this issue 3 years ago • 0 comments

Bug description

The consumer group backlog for the healthcheck topic's partition 0 is slowly creeping up. It's not a deal breaker or anything, but ideally we could have a consumer with round trip to kafka without growing the lag.

Routine for Kafka health check. https://github.com/PostHog/posthog/blob/master/plugin-server/src/main/utils.ts#L51

Environment

[x] PostHog Cloud
[x] self-hosted PostHog (ClickHouse-based), version/commit: please provide
[x] self-hosted PostHog (Postgres-based, legacy), version/commit: please provide

Additional context

This isn't a huge problem, but more of a metrics and hygiene issue with kafka. Consumer group latency is a pretty standard way to check health of the cluster and your app. This breaks it for at least one topic.

What Klarna does using kafkajs library for health checks is https://github.com/tulios/kafkajs/issues/452#issuecomment-517747429

Thank you for your bug report – we love squashing them!

Jul 31 '22 16:07 fuziontech