[SUPPORT] Querying Hudi tables with Spark+Velox(C++), ObjectSizeCalculator.getObjectSize hangs causing about a 50-second delay in queries
Describe the problem you faced
When I query Hudi tables using Spark+Velox, I encounter a timeout error when it gets to ObjectSizeCalculator.getObjectSize.
The main issue occurs after enabling Velox, during the initialization of ServiceabilityAgentSupport and the execution of the needSudo method, where an error prevents obtaining the result, causing a hang for 50 seconds followed by a timeout error, and then a default singleton is returned. This problem occurs on my Spark cluster's driver and each executor the first time they start up. However, when Velox is disabled, this method usually completes in just over a second.
Although this problem can currently be circumvented by setting jol.skipHotspotSAAttach=true, its occurrence is unexpected. Is anyone aware of the cause of this problem? Could there be an incompatibility issue between the method called and Velox? Is a rollback necessary?
WARNING INFORMATION:
# WARNING: Unable to attach Serviceability Agent. You can try again with escalated privileges. Two options: a) use -Djol.tryWithSudo=true to try with sudo; b) echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
24/01/26 11:40:04 INFO HoodieBackedTableMetadata: Opened 1 metadata log files (dataset instant=20240103095853633, metadata instant=20231215140043588001) in 48126 ms
To Reproduce
Steps to reproduce the behavior:
- query Hudi tables using Spark+Velox
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
-
Hudi version : 0.14.0
-
Spark version : 3.3
-
Running on Docker? (yes/no) : yes
Additional context
Add any other context about the problem here.
Stacktrace
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:502)
java.lang.UNIXProcess.waitFor(UNIXProcess.java:395)
org.apache.hudi.org.openjdk.jol.vm.sa.ServiceabilityAgentSupport.callAgent(ServiceabilityAgentSupport.java:190)
org.apache.hudi.org.openjdk.jol.vm.sa.ServiceabilityAgentSupport.needSudo(ServiceabilityAgentSupport.java:109)
org.apache.hudi.org.openjdk.jol.vm.sa.ServiceabilityAgentSupport.<init>(ServiceabilityAgentSupport.java:88)
org.apache.hudi.org.openjdk.jol.vm.sa.ServiceabilityAgentSupport.instance(ServiceabilityAgentSupport.java:77)
org.apache.hudi.org.openjdk.jol.vm.VM.current(VM.java:77)
org.apache.hudi.org.openjdk.jol.info.GraphWalker.walk(GraphWalker.java:97)
org.apache.hudi.org.openjdk.jol.info.GraphLayout.parseInstance(GraphLayout.java:54)
org.apache.hudi.common.util.ObjectSizeCalculator.getObjectSize(ObjectSizeCalculator.java:57)
@majian1998 #10504 look like we met the same problem, but we not use velox also met it.
@KnightChess It seems like you encountered this issue just once and got stuck for a long time, right? On my end, I can consistently reproduce the problem, but it only gets stuck for 50 seconds. T-T
@majian1998 Is this issue occurring after 0.14.0 upgrade or it was happening with older Hudi version too?
@ad1happy2go I understand that the issue started when the PR [HUDI-4687] introduced the use of jol to estimate object size.
Interesting. So, we had done a micro-benchmark and we found that there was about 5% slowness due to JOL. And since we already invoke this for only a sample of records and not all records in the batch, we did not consider other alternatives (as mentioned in the description of PR). The main reason it was added because Trino upgraded to Java 17 and trino-hudi connector build started failing (reason mentioned in the PR).
I am curious if something else is going on because object size calculation lies on the hotpath, this issue would have surfaced in other large scale benchmarks that we run before release.
@codope In the scenario I described (jdk8+spark3.3+velox), there was a significant delay. Regardless of how long the query took to execute, it would always time out and result in an error after about 45 to 60 seconds. This can be critical for queries that are supposed to finish within a few minutes. However, when I turn off velox, there are no warnings at all.
Although by reading the code, this issue can be bypassed with JVM parameters, this introduces extra learning overhead for users.
I suspect there might be a compatibility issue between the JOL component and velox?
Hudi 0.14.0 + Velox, same problem, -Djol.skipHotspotSAAttach=true works! Thanks!