hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Streaming improvements

Open VitoMakarevich opened this issue 2 years ago • 2 comments

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

As of now - the Hudi streaming source has limited features compared to well-known Sources, such as Apache Kafka. We needed such functionality and managed to add it and run it for > 6 months with no problems. The list of features we have among the current master:

  1. Support of TriggerAvailableNow
  2. Support for Admission control(similar to Kafka maxTriggerDelay/minNumberRows/etc) - allows you to limit the number of rows processed in a single batch(conf value setting) - for this, we read the commit line and pack commits given number of changed rows inside - ofc this will lead to final number be less than customer set, because some rows may have updates.
  3. Metrics about this logic(like batch delayed/started/initial run).
  4. Metrics about backlog - such as how many rows are in backlog(logic from above) / batches to be run(based on rows/records per batch).
  5. custom start/end time - currently only start can be defined.

Given this - would the Hudi team be interested in this? I can create the Pull Request.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version :

  • Spark version :

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) :

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

VitoMakarevich avatar Dec 18 '23 13:12 VitoMakarevich

@VitoMakarevich This can be a good value add to our streaming source. cc @nsivabalan @codope @yihua @xushiyan @danny0405

ad1happy2go avatar Dec 18 '23 16:12 ad1happy2go

Just to note: we are migrating our incremental query semantics to base on completion time: https://github.com/apache/hudi/pull/10255, this change might be relevent.

danny0405 avatar Dec 19 '23 03:12 danny0405