hudi [SUPPORT][metadata] Slow performance when inserting to a lot of partitions and metadata enabled

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at [email protected].
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

Recently we enabled metadata for one big table. This table includes ~700k partitions in the test environment and the usual insert affects ~600. Also it's important to say it's even more pressing if we have enabled the timeline server. The issue we have is visible in the slow getting small files stage - e.g. for 600 insert-affected partitions(600 tasks in this stage) it takes ~2-3 minutes, and for 12k tasks(partitions affected) - ~40 minutes. important details:

Metadata is enabled
Timeline server is enabled
Metadata table hfile file size - about 40MB
Number of log files: "hoodie.metadata.compact.max.delta.commits" = "1"
Only file listing is enabled in metadata I dug into this a lot and enabled detailed logs and this is what I think maybe the issue: In logs "Metadata read for %s keys took [baseFileRead, logMerge] %s ms" - half of the values are single digit numbers, half are > 1000, e.g. Metadata read for 1 keys took [baseFileRead, logMerge] [0, 12456] ms. There are also logs like Updating metadata metrics (basefile_read.totalDuration=12155) Updating metadata metrics (lookup_files.totalDuration=12156) So my suspect is that given such a large metadata file(40mb), it looks like hfile lookup is suboptimal. Do you aware of any issue similar to this(in 0.12.1 and 0.12.2)? As I see in the code it seeks up to a partition path, may it happen that readers are somehow not being reused, so these 40mb files are seeked again and again thousands of times? As a remediate, we turned off the embed server, so as I understand same thing will be done on executors, but it will be less problematic since parallelism is much bigger. Also, the downside is markers logic listing.

To Reproduce

Steps to reproduce the behavior:

Probably generating a large set of partitions(up to 30-40 MB hfile size) and running insert to thousands of partitions may reproduce it.

Expected behavior

getting small files should not be so long. Without metadata + embed server it takes ~40sec to check 12k partitions, while with metadata it's 40+ minutes.

Environment Description

Hudi version : 0.12.1-0.12.2
Spark version : 3.3.0-3.3.1
Hive version :
Hadoop version : 3.3.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : yes

Additional context I can try to make a reproduction if you don't know about anything like this.

Stacktrace

Add the stacktrace of the error.

Oct 11 '23 13:10 VitoMakarevich

I tried to change the settings to

        "hoodie.hfile.max.file.size" = "45829120"
        "hoodie.hfile.block.size" = "548576"
        "hoodie.hfile.compression.algorithm" = "snappy"

it looked to be fast before the first compaction happened, I mean when I had a couple of.log files it was fast - ~1200 partitions lookup in ~60MB file lookup was taking seconds, but after the first compaction, it started to be even slower.

Oct 13 '23 10:10 VitoMakarevich

As I understood, it works fast until the first HBase record appears, I checked and it's even about the same size, but while it's a log - it cached in some internal spillable map, but once it becomes HFile - it's all their code and it looks either in your code/their code it becomes slow.

Oct 13 '23 17:10 VitoMakarevich

@alexeykudinkin I saw you were touching logblocksreader, can you check if the HFile integration you have has the same performance?

Oct 13 '23 17:10 VitoMakarevich

So I debugged it to a point where I'm 99% confident that the issue is a suboptimal lookup in hbase file. As I checked in the minor environment it's about 400mb uncompressed(stored as 40mb compressed) - and it takes seconds to find partition files in it.

Oct 13 '23 18:10 VitoMakarevich

Confirmed to be the same in 0.13.1...

Oct 25 '23 15:10 VitoMakarevich

@VitoMakarevich Data Lake design cann't handle so many partitions. The partition column has to be less granular. As mentioned table has 700K partition which can cause lot of metadata overhead and it should not be used as partition column

Oct 17 '24 12:10 ad1happy2go

@VitoMakarevich Closing this issue as it is expected only. Feel free to reopen if you think otherwise.

Nov 07 '24 16:11 ad1happy2go