hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] is it possible to read/write hudi files with another programming language?

Open schlichtanders opened this issue 3 years ago • 13 comments

Hi,

I am curious about the state of hudi. We are currently using it via Spark, however thinking about switching to another language.

Is it possible to write Hudi files via C, C++, Rust, or anything? Or is it completely tied to Spark/Flink?

Thank you very much for your help

schlichtanders avatar Dec 13 '22 13:12 schlichtanders

Not yet, but it's planned for version 1.0.0. https://hudi.apache.org/roadmap/

Currently, one can use Hudi with Python (pyspark), Java and Scala.

codope avatar Dec 13 '22 13:12 codope

Thank you for the pointer to the roadmap. Some C/Rust implementation would be nice for the entire LLVM ecosystem. I myself am looking forward to use Julia together with Hudi some day in the future. (Julia also compiles via LLVM, so a C binding would be optimal).

As the 1.0.0 may still be far in the future, is the java API also accessible outside from Apache Spark? I mean as a pure java library, which could be loaded by some other languages?

schlichtanders avatar Dec 15 '22 10:12 schlichtanders

Hi @schlichtanders Hudi has the pure Java API for writing tables through HoodieJavaWriteClient. You can check the examples in HoodieJavaWriteClientExample.

I'll close this issue for now. Feel free to reopen the issue if you have more questions.

yihua avatar Dec 22 '22 20:12 yihua

@yihua is there also a ReadClient? An example would also be great.

schlichtanders avatar May 16 '23 09:05 schlichtanders

@yihua

schlichtanders avatar May 31 '23 15:05 schlichtanders

Hudi is certainly lacking behind in native support on other languages, Iceberg and Delta already have some pretty nice libraries such as delta-rs and pyiceberg for reading and writing files without a JVM.

cheunhong avatar Jan 23 '24 18:01 cheunhong

Thank you @cheunhong. I agree and it is a pity. Hudi's support for streaming is super attractive for me. Neither delta-rs nor iceberg have it as far as I knew...

schlichtanders avatar Jan 24 '24 09:01 schlichtanders

Thank you @cheunhong. I agree and it is a pity. Hudi's support for streaming is super attractive for me. Neither delta-rs nor iceberg have it as far as I knew...

@schlichtanders @cheunhong I missed this discussion. We are considering different language support. If you have a use case I’d love to chat with you about that and see how the use case can be better supported.

We have an experimental PR on read support in Python: #8768 . We have also introduced a Hudi file group reader to make read integration in engines easier.

yihua avatar Mar 09 '24 23:03 yihua

For me Python is actually not the problem - via Spark and Flink it is pretty well supported.

My use case is to use the modern programming language Julia directly, without the JVM inbetween, because the language itself is high performant and has distributed computing support. A perfect match for working with Hudi both as big data as well as streaming. Hence it would be great if Hudi is accessible also without Spark and Flink, i.e. without JVM.

schlichtanders avatar Mar 11 '24 08:03 schlichtanders

I know I was looking into a Rust implementation due to the work that's happening on pg_analytics by ParadeDB, where they purely had to choose delta-rs due to being dependent on Rust tooling to create the Postgres extension. The use case in this instance is that theoretically, if you integrate Hudi (or like they are doing, Delta Lake) as a Postgres extension you can very easily offload data directly on to your data lake to transition to a lakehouse architecture much more easily and avoid having to use external ETL tooling.

A lot of the OSS work being done by Materialize.com , Neon,tech , DataBend is all happening in Rust so theoretically if Hudi could integrate with modern development happening in Rust it could be a big win for the ecosystem I imagine.

rubenatterbury avatar Mar 26 '24 14:03 rubenatterbury

@xushiyan do you want to share the budding hudi-rs and python bindings here, to see if anyone wants to chip in for contributions

vinothchandar avatar May 02 '24 16:05 vinothchandar

https://github.com/xushiyan/hudi-rs has some basic reads with datafusion?

vinothchandar avatar May 02 '24 16:05 vinothchandar

@vinothchandar yes. gonna take care of repo logistics and dev setup to make the repo ready for new contributors. Also preparing issues to work on.

xushiyan avatar May 02 '24 21:05 xushiyan

@rubenatterbury @schlichtanders @cheunhong we have officially released hudi-rs 0.1.0 ! https://github.com/apache/hudi-rs

xushiyan avatar Jul 19 '24 05:07 xushiyan