[CI] Enable Java test in CI workflow
This PR adds changes for Java CI.
Some scripts modified here also appear in PR #831. Once 831 is merged, I’ll rebase and make sure everything stays consistent.
@rhdong could you please put this PR into draft until you're ready for reviews? That'd reduce the notifications reviewers are getting, and help them understand when it's time to come review.
@rhdong could you please put this PR into draft until you're ready for reviews? That'd reduce the notifications reviewers are getting, and help them understand when it's time to come review.
Thanks for the reminder! I’ve marked the PR as draft now.
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.
Contributors can view more details about this message here.
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
/ok to test
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
/ok to test
/ok to test
@rhdong, there was an error processing your request: E1
See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/
/ok to test 79a477ece8e32f75de95071cc8b1403e7eafbbfc
The reason for the Java build failure is that the script could not find jextract (needed for generating Panama bindings before Java build and tests). The CI servers need to have jextract preinstalled from https://jdk.java.net/jextract
@cjnolet @rhdong
@narangvivek10 @rhdong @cjnolet I've committed a fix [0] to download jextract automatically if not already installed. Reason for doing this is that jextract doesn't have a .deb or apt package for Ubuntu, and hence it the download of jextract needs to be scripted anyway.
[0] - https://github.com/rapidsai/cuvs/pull/831/commits/570fa2a7a792b39cb70c4ff1232661481ba8ecaa in https://github.com/rapidsai/cuvs/pull/831
/ok to test 04f0bba69f7bcc7968c2f62a9d04520b083dedc9
@narangvivek10 The jextract process failed due to:
c_api.h:19:10: error: 'cuda_runtime.h' file not found
fatal: Unexpected exception org.openjdk.jextract.clang.TypeLayoutError: Invalid. segment: org.openjdk.jextract.clang.Type@cc813a2e, fieldName: n_probes occurred
Jextract encountered issues (returned value 5)
Bindings generation did not complete normally (returned value 5)
Forcing this build process to abort
Any ideas where the cuda_runtime.h will be found?
@rhdong We have attempted to find the CUDA_HOME diras:
CUDA_HOME=$(which nvcc | cut -d/ -f-4)
And then tried to add the $CUDA_HOME/include dir to the include paths. Any ideas if this was the problem and is there a better way?
Also, I see the following:
2025-04-24T16:03:41.9414077Z Forcing this build process to abort
2025-04-24T16:03:41.9513171Z
2025-04-24T16:03:41.9516434Z [32mRAPIDS logger[0m » [04/24/25 16:03:41]
2025-04-24T16:03:41.9517878Z [32m┌─────────────────────────────────────────────────────────────────────────────┐[0m
2025-04-24T16:03:41.9519658Z [32m| Initial Java build & test failed. Retrying with 'mvn clean verify -X' |[0m
2025-04-24T16:03:41.9521203Z [32m└─────────────────────────────────────────────────────────────────────────────┘[0m
2025-04-24T16:03:41.9522013Z
I think this retrying is not necessary, and not correct either, since here the failure was in a step even before Maven is invoked (failure is in the generate-bindings.sh file). Due to this retry, the logs are polluted with a lot of symbol not found issues via Maven, and it masks the original problem that the Panama bindings were not properly generated.
@rhdong I've made the following changes:
- Debug printing of the CUDA_HOME variable, and the contents of the $CUDA_HOME/include
- If there's no include inside CUDA_HOME, try CUDA_HOME to be /usr/local/cuda
https://github.com/rapidsai/cuvs/pull/831/files/570fa2a7a792b39cb70c4ff1232661481ba8ecaa..306229d29b0123bc7f6e72adca6e7d155047f528
I'm hoping it will make things work. Can you please merge that and retest here?
@rhdong We have attempted to find the CUDA_HOME diras:
CUDA_HOME=$(which nvcc | cut -d/ -f-4)And then tried to add the $CUDA_HOME/include dir to the include paths. Any ideas if this was the problem and is there a better way?
Hi @chatman @narangvivek10 , The docker image is rapidsai/ci-conda:latest, the cuda includes will be installed when creating the conda env test, as my local experiment, the cuda_runtime.h is in /opt/conda/envs/test/targets/x86_64-linux/include/cuda_runtime.h , the test env name is test. So I fixed it by the top commit, and the new error comes up:
jextract-22/bin/jextract.ps1
jextract downloaded to /cuvs/java/jextract-22
common.h:21:10: error: 'dlpack/dlpack.h' file not found
fatal: Unexpected exception org.openjdk.jextract.clang.TypeLayoutError: Invalid. segment: org.openjdk.jextract.clang.Type@1c99c732, fieldName: addr occurred
Jextract encountered issues (returned value 5)
Bindings generation did not complete normally (returned value 5)
Forcing this build process to abort
RAPIDS logger » [04/24/25 20:26:13]
Also, I see the following:
2025-04-24T16:03:41.9414077Z Forcing this build process to abort 2025-04-24T16:03:41.9513171Z 2025-04-24T16:03:41.9516434Z �[32mRAPIDS logger�[0m » [04/24/25 16:03:41] 2025-04-24T16:03:41.9517878Z �[32m┌─────────────────────────────────────────────────────────────────────────────┐�[0m 2025-04-24T16:03:41.9519658Z �[32m| Initial Java build & test failed. Retrying with 'mvn clean verify -X' |�[0m 2025-04-24T16:03:41.9521203Z �[32m└─────────────────────────────────────────────────────────────────────────────┘�[0m 2025-04-24T16:03:41.9522013ZI think this retrying is not necessary, and not correct either, since here the failure was in a step even before Maven is invoked (failure is in the
generate-bindings.shfile). Due to this retry, the logs are polluted with a lot of symbol not found issues via Maven, and it masks the original problem that the Panama bindings were not properly generated.
Yeah, I agree, the main goal of the retry is to debug the issue of HNSW(has been resolved). The retry only happens when test fails. We can remove it at last.
/ok to test cb5d1ba18d3431456f9468be5adf81b453823314
Closes #845
/ok to test b568c6a191f90cbf66465f69192a8eba90153f6c
/ok to test 4aa812f07e6d8586614b057024c377db6188b90d
The C++ tests failed here with a segfault.
https://github.com/rapidsai/cuvs/actions/runs/14720279064/job/41313393923?pr=805#step:9:1585
CMake Error at run_gpu_test.cmake:35 (execute_process):
execute_process failed command indexes:
1: "Abnormal exit with child return code: Segmentation fault"
@cjnolet @rhdong Any ideas, please?
/ok to test 8a00ccafde97524a864b0f502db336150bfc68ea
Edit: Ignore this comment. It was based on my misunderstanding as to where the problem originated from.
-- Build files have been written to: /__w/cuvs/cuvs/java/internal/build
[1/2] Building C object CMakeFiles/cuvs_java.dir/src/cuvs_java.c.o
[2/2] Linking C shared library libcuvs_java.so
Starting Panama FFM API bindings generation ...
/opt/conda/envs/include does not exist.
Couldn't find a suitable CUDA include directory.
RAPIDS logger » [05/01/25 15:32:22]
Trying to find includes in /opt/conda/envs/include is surprising.
The relevant parts from the script are:
CUDA_HOME=$(which nvcc | cut -d/ -f-4)
..
jextract \
--include-dir ${REPODIR}/java/internal/build/_deps/dlpack-src/include/ \
--include-dir ${CUDA_HOME}/targets/x86_64-linux/include \
--include-dir ${REPODIR}/cpp/include \
--output "${REPODIR}/java/cuvs-java/src/main/java22/" \
--target-package ${TARGET_PACKAGE} \
--header-class-name PanamaFFMAPI \
${CURDIR}/headers.h
Out of the three --include-dir params, I can't understand why any of those would point to /opt/conda/envs/include.
I have a few ideas:
-
Can we revert the change from
--include-dir ${CUDA_HOME}/targets/x86_64-linux/includeto--include-dir ${CUDA_HOME}/includeand see if that works? In my local system, both these works (even though thecuda_runtime.his only available in the targets/x86_64-linux dir. -
Can we add more debug printing to understand how these variables like CUDA_HOME, REPODIR etc. are resolving? Consequent to that, shall we see if CUDA_HOME should be computed differently (instead of trying to find where
nvcclies)?
Oh, I misunderstood where the error is coming from. It is erroring out much before jextract command.
Can we try the patch attached here?