[SUPPORT] Streaming improvements
Tips before filing an issue
-
Have you gone through our FAQs?
-
Join the mailing list to engage in conversations and get faster support at [email protected].
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
As of now - the Hudi streaming source has limited features compared to well-known Sources, such as Apache Kafka. We needed such functionality and managed to add it and run it for > 6 months with no problems. The list of features we have among the current master:
- Support of TriggerAvailableNow
- Support for Admission control(similar to Kafka maxTriggerDelay/minNumberRows/etc) - allows you to limit the number of rows processed in a single batch(
confvalue setting) - for this, we read the commit line and pack commits given number of changed rows inside - ofc this will lead to final number be less than customer set, because some rows may have updates. - Metrics about this logic(like batch delayed/started/initial run).
- Metrics about backlog - such as how many rows are in backlog(logic from above) / batches to be run(based on rows/records per batch).
- custom start/end time - currently only
startcan be defined.
Given this - would the Hudi team be interested in this? I can create the Pull Request.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
-
Hudi version :
-
Spark version :
-
Hive version :
-
Hadoop version :
-
Storage (HDFS/S3/GCS..) :
-
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
@VitoMakarevich This can be a good value add to our streaming source. cc @nsivabalan @codope @yihua @xushiyan @danny0405
Just to note: we are migrating our incremental query semantics to base on completion time: https://github.com/apache/hudi/pull/10255, this change might be relevent.