hudi [SUPPORT] Streaming improvements

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at [email protected].
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

As of now - the Hudi streaming source has limited features compared to well-known Sources, such as Apache Kafka. We needed such functionality and managed to add it and run it for > 6 months with no problems. The list of features we have among the current master:

Support of TriggerAvailableNow
Support for Admission control(similar to Kafka maxTriggerDelay/minNumberRows/etc) - allows you to limit the number of rows processed in a single batch(conf value setting) - for this, we read the commit line and pack commits given number of changed rows inside - ofc this will lead to final number be less than customer set, because some rows may have updates.
Metrics about this logic(like batch delayed/started/initial run).
Metrics about backlog - such as how many rows are in backlog(logic from above) / batches to be run(based on rows/records per batch).
custom start/end time - currently only start can be defined.

Given this - would the Hudi team be interested in this? I can create the Pull Request.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Dec 18 '23 13:12 VitoMakarevich

@VitoMakarevich This can be a good value add to our streaming source. cc @nsivabalan @codope @yihua @xushiyan @danny0405

Dec 18 '23 16:12 ad1happy2go

Just to note: we are migrating our incremental query semantics to base on completion time: https://github.com/apache/hudi/pull/10255, this change might be relevent.

Dec 19 '23 03:12 danny0405