redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Metrics for transform logging

Open oleiman opened this issue 2 years ago • 10 comments

This PR introduces some metrics for transform logging:

logger_probe for tracking metrics specific to individual transform loggers:

  • data_transforms_logger_events_total
    • Total # of log events emitted by some transform.
  • data_transforms_logger_events_dropped_total
    • Total # of some transform's log events that were dropped due to buffer capacity constraint.
    • exported to BOTH /metrics and /public_metrics

manager_probe for tracking metrics generic to the logging::manager:

  • data_transforms_log_manager_buffer_usage_ratio
    • Current occupancy of the logging::manager's queues as a fraction of total capacity. [0.0..1.0]
  • data_transforms_log_manager_write_errors_total
    • Total number of failures to produce log events to the transform logs topic.
    • exported ONLY to /metrics

Closes https://github.com/redpanda-data/core-internal/issues/1059

Backports Required

  • [ ] none - not a bug fix
  • [ ] none - this is a backport
  • [ ] none - issue does not exist in previous branches
  • [ ] none - papercut/not impactful enough to backport
  • [x] v23.3.x
  • [ ] v23.2.x
  • [ ] v23.1.x

Release Notes

Features

  • Add metrics for data transforms logging observability

oleiman avatar Feb 09 '24 20:02 oleiman

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44956#018d8ff9-26a8-437a-b7fd-c030d809c4b5

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45064#018daf99-bc28-43d6-a0f4-359f521e5ff9

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45256#018dd2c1-1634-4257-b095-998374e6710d

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45310#018dd791-0a37-47f6-a260-6fa6cbb5a7fc

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/45660#018e0c2f-6d7d-41a7-9e1d-91cbce916bbb

vbotbuildovich avatar Feb 09 '24 23:02 vbotbuildovich

new failures in https://buildkite.com/redpanda/redpanda/builds/44956#018d9005-98cf-4418-ad7f-2e328d86ee50:

"rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=0.cloud_storage_type=CloudStorageType.S3"

new failures in https://buildkite.com/redpanda/redpanda/builds/45064#018daf88-57ac-4ba6-8598-bce02b094d94:

"rptest.tests.cpu_profiler_admin_api_test.CPUProfilerAdminAPITest.test_get_cpu_profile_with_override"

new failures in https://buildkite.com/redpanda/redpanda/builds/45302#018dd6af-b3df-49a3-bf18-a1d86ccef4d9:

"rptest.tests.data_transforms_test.DataTransformsLoggingMetricsTest.test_manager_metrics_values"

vbotbuildovich avatar Feb 09 '24 23:02 vbotbuildovich

CI Failures:

  • https://github.com/redpanda-data/redpanda/issues/16540

oleiman avatar Feb 10 '24 00:02 oleiman

force push contents:

  • remove unnecessary check on btree_map::emplace result at probe construction
  • try to account for the possibility of transform retries in tests
  • typos, etc.

oleiman avatar Feb 11 '24 00:02 oleiman

/ci-repeat 5 skip-unit dt-repeat=50 tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest

oleiman avatar Feb 11 '24 00:02 oleiman

/ci-repeat 5 skip-unit dt-repeat=50 tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest

oleiman avatar Feb 11 '24 00:02 oleiman

/dt

oleiman avatar Feb 11 '24 00:02 oleiman

/ci-repeat 1

oleiman avatar Feb 16 '24 00:02 oleiman

CI Failure:

  • https://github.com/redpanda-data/redpanda/issues/16618 (NEW)
    • doesn't seem related to this PR (passed on the release build)

oleiman avatar Feb 16 '24 03:02 oleiman

/ci-repeat 5 skip-unit skip-redpanda-build dt-repeat=50 tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest

oleiman avatar Feb 16 '24 03:02 oleiman

force push contents:

  • change 'transform_name' label to 'function_name' (to match other transform metrics as in transform/probe.cc
  • Use convenient add_group interface for internal metrics and remove aggregation config checks
  • redundant/dead code

oleiman avatar Feb 22 '24 20:02 oleiman

/ci-repeat 5 skip-unit skip-redpanda-build dt-repeat=50 tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest

oleiman avatar Feb 23 '24 15:02 oleiman

force push contents:

  • Move probe deinit to log_manager::stop. As a consequence, move some other member init/deinit to start/stop. @BenPope - good call out in standup.

oleiman avatar Feb 23 '24 17:02 oleiman

force push contents: fix broken unit test (log_manager::stop should be idempotent)

oleiman avatar Feb 23 '24 19:02 oleiman

/cdt num_nodes=5 dt-repeat=10 tests/rptest/tests/data_transforms_test.py::DataTransformsLoggingMetricsTest

oleiman avatar Feb 26 '24 18:02 oleiman

force push uint32_t -> uint64_t for counters

oleiman avatar Feb 28 '24 05:02 oleiman

force push import cleanup and produce even less for _values test to be safe.

oleiman avatar Mar 05 '24 00:03 oleiman

/backport v23.3.x

vbotbuildovich avatar Mar 05 '24 05:03 vbotbuildovich

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-16566-v23.3.x-647 remotes/upstream/v23.3.x
git cherry-pick -x c31ea7f3c9adcc395e7b025dbd3c71d0c916a491 6d59f927053f8117bc14bddf273ef4b15773c3c1 b3a663798f0ce6c14d20805251ca3aa20316e808

Workflow run logs.

vbotbuildovich avatar Mar 05 '24 05:03 vbotbuildovich