hudi [FEATURE REQUEST] Nested idempotency support

Hi Community,

I am seeking guidance on handling nested idempotent support for a large-scale data scenario involving contracts with third-party vendors for transferring items. Each contract (identified by contractId) has around 100,000 items (identified by itemId), with a total of 6 million contracts per month for me, growing by 50% yearly. I want to use contractId as the hoodie_record_key and store the list of items as a nested field. All items within a contract share the same contract-related attributes. In the future, for a given contractId and items may be added, deleted, or updated for me, requiring me to fetch the item array and update necessary items. While I understand Hudi doesn't natively support deduplicating items in the array, I'm looking for a configuration-driven approach that might be useful for many projects. However, I acknowledge that updating nested fields could have performance implications for me as the number of items per contract grows.

Is it planned for hudi's future goals

Thanks.

May 14 '24 06:05 bibhu107

@bibhu107 Why can't be achieve this with current functionality? You can preprocess your data frame doing something like groupBy and collect_list and then save to hudi. You can further implement your custom payload to do whatever you want to achieve merging list (Previous and current)

Although as each contract id has 100,000 items, if we create nested structure then single record payload itself will be too huge and performance will be very bad. Not sure if that big list even JVM will be able to accommodate and it can fail. Why can't have denormalised structure with record key as contract_id and item_id.

May 14 '24 06:05 ad1happy2go

@bibhu107 Let us know in case you have further questions on this or any confusion on above. Feel free to comment in case I have not understood the issue correctly. Thanks.

May 15 '24 08:05 ad1happy2go

Thanks for providing your suggestions @ad1happy2go

Even right now we are doing groupingBy and collect_list this is failing when the array size is more than 2GB
As you can see all items has similar data as all belongs to a same contract, now denormalising item_ids that might lead to lots of duplicate data between two items or two data sets having common contract details.

The reason I am approaching Hudi to solve this is because simple groupingBy and collect_list is not working, rather if we can smartly indexOut where is the item that needs to be updated that might be more useful.

May 15 '24 09:05 bibhu107

@bibhu107 It should not be straight forward to implement that. According to me thinking of a design to have a list with 100,000 items itself not correct. It is not scalable at all. The job is going to fail only.

If you are concerned about data redundancy, why not create two tables. one containing contract details and other table containing item details(with contract id). In the second table, both contract id and item id can make the record key.

May 15 '24 13:05 ad1happy2go

@ad1happy2go Thanks for suggesting. This makes sense. Even I was thinking in same direction for two different tables for it.

May 15 '24 15:05 bibhu107