cuvs icon indicating copy to clipboard operation
cuvs copied to clipboard

[CI] Enable Java test in CI workflow

Open rhdong opened this issue 10 months ago • 33 comments

This PR adds changes for Java CI.

Some scripts modified here also appear in PR #831. Once 831 is merged, I’ll rebase and make sure everything stays consistent.

rhdong avatar Apr 03 '25 22:04 rhdong

@rhdong could you please put this PR into draft until you're ready for reviews? That'd reduce the notifications reviewers are getting, and help them understand when it's time to come review.

jameslamb avatar Apr 04 '25 14:04 jameslamb

@rhdong could you please put this PR into draft until you're ready for reviews? That'd reduce the notifications reviewers are getting, and help them understand when it's time to come review.

Thanks for the reminder! I’ve marked the PR as draft now.

rhdong avatar Apr 04 '25 14:04 rhdong

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Apr 04 '25 15:04 copy-pr-bot[bot]

/ok to test

rhdong avatar Apr 04 '25 20:04 rhdong

/ok to test

rhdong avatar Apr 07 '25 21:04 rhdong

/ok to test

rhdong avatar Apr 07 '25 21:04 rhdong

/ok to test

rhdong avatar Apr 08 '25 18:04 rhdong

/ok to test

rhdong avatar Apr 08 '25 21:04 rhdong

/ok to test

rhdong avatar Apr 08 '25 21:04 rhdong

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Apr 22 '25 22:04 copy-pr-bot[bot]

/ok to test

rhdong avatar Apr 22 '25 22:04 rhdong

/ok to test

@rhdong, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

copy-pr-bot[bot] avatar Apr 22 '25 22:04 copy-pr-bot[bot]

/ok to test 79a477ece8e32f75de95071cc8b1403e7eafbbfc

rhdong avatar Apr 22 '25 22:04 rhdong

The reason for the Java build failure is that the script could not find jextract (needed for generating Panama bindings before Java build and tests). The CI servers need to have jextract preinstalled from https://jdk.java.net/jextract

@cjnolet @rhdong

Screenshot from 2025-04-24 12-59-56

narangvivek10 avatar Apr 24 '25 07:04 narangvivek10

@narangvivek10 @rhdong @cjnolet I've committed a fix [0] to download jextract automatically if not already installed. Reason for doing this is that jextract doesn't have a .deb or apt package for Ubuntu, and hence it the download of jextract needs to be scripted anyway.

[0] - https://github.com/rapidsai/cuvs/pull/831/commits/570fa2a7a792b39cb70c4ff1232661481ba8ecaa in https://github.com/rapidsai/cuvs/pull/831

chatman avatar Apr 24 '25 08:04 chatman

/ok to test 04f0bba69f7bcc7968c2f62a9d04520b083dedc9

rhdong avatar Apr 24 '25 15:04 rhdong

@narangvivek10 The jextract process failed due to:

c_api.h:19:10: error: 'cuda_runtime.h' file not found
fatal: Unexpected exception org.openjdk.jextract.clang.TypeLayoutError: Invalid. segment: org.openjdk.jextract.clang.Type@cc813a2e, fieldName: n_probes occurred
Jextract encountered issues (returned value 5)
Bindings generation did not complete normally (returned value 5)
Forcing this build process to abort

Any ideas where the cuda_runtime.h will be found?

chatman avatar Apr 24 '25 17:04 chatman

@rhdong We have attempted to find the CUDA_HOME diras: CUDA_HOME=$(which nvcc | cut -d/ -f-4)

And then tried to add the $CUDA_HOME/include dir to the include paths. Any ideas if this was the problem and is there a better way?

chatman avatar Apr 24 '25 18:04 chatman

Also, I see the following:

2025-04-24T16:03:41.9414077Z Forcing this build process to abort
2025-04-24T16:03:41.9513171Z 
2025-04-24T16:03:41.9516434Z [32mRAPIDS logger[0m » [04/24/25 16:03:41]
2025-04-24T16:03:41.9517878Z [32m┌─────────────────────────────────────────────────────────────────────────────┐[0m
2025-04-24T16:03:41.9519658Z [32m|    Initial Java build & test failed. Retrying with 'mvn clean verify -X'    |[0m
2025-04-24T16:03:41.9521203Z [32m└─────────────────────────────────────────────────────────────────────────────┘[0m
2025-04-24T16:03:41.9522013Z 

I think this retrying is not necessary, and not correct either, since here the failure was in a step even before Maven is invoked (failure is in the generate-bindings.sh file). Due to this retry, the logs are polluted with a lot of symbol not found issues via Maven, and it masks the original problem that the Panama bindings were not properly generated.

chatman avatar Apr 24 '25 18:04 chatman

@rhdong I've made the following changes:

  • Debug printing of the CUDA_HOME variable, and the contents of the $CUDA_HOME/include
  • If there's no include inside CUDA_HOME, try CUDA_HOME to be /usr/local/cuda

https://github.com/rapidsai/cuvs/pull/831/files/570fa2a7a792b39cb70c4ff1232661481ba8ecaa..306229d29b0123bc7f6e72adca6e7d155047f528

I'm hoping it will make things work. Can you please merge that and retest here?

chatman avatar Apr 24 '25 18:04 chatman

@rhdong We have attempted to find the CUDA_HOME diras: CUDA_HOME=$(which nvcc | cut -d/ -f-4)

And then tried to add the $CUDA_HOME/include dir to the include paths. Any ideas if this was the problem and is there a better way?

Hi @chatman @narangvivek10 , The docker image is rapidsai/ci-conda:latest, the cuda includes will be installed when creating the conda env test, as my local experiment, the cuda_runtime.h is in /opt/conda/envs/test/targets/x86_64-linux/include/cuda_runtime.h , the test env name is test. So I fixed it by the top commit, and the new error comes up:

jextract-22/bin/jextract.ps1
jextract downloaded to /cuvs/java/jextract-22
common.h:21:10: error: 'dlpack/dlpack.h' file not found
fatal: Unexpected exception org.openjdk.jextract.clang.TypeLayoutError: Invalid. segment: org.openjdk.jextract.clang.Type@1c99c732, fieldName: addr occurred
Jextract encountered issues (returned value 5)
Bindings generation did not complete normally (returned value 5)
Forcing this build process to abort

RAPIDS logger » [04/24/25 20:26:13]

rhdong avatar Apr 24 '25 20:04 rhdong

Also, I see the following:

2025-04-24T16:03:41.9414077Z Forcing this build process to abort
2025-04-24T16:03:41.9513171Z 
2025-04-24T16:03:41.9516434Z �[32mRAPIDS logger�[0m » [04/24/25 16:03:41]
2025-04-24T16:03:41.9517878Z �[32m┌─────────────────────────────────────────────────────────────────────────────┐�[0m
2025-04-24T16:03:41.9519658Z �[32m|    Initial Java build & test failed. Retrying with 'mvn clean verify -X'    |�[0m
2025-04-24T16:03:41.9521203Z �[32m└─────────────────────────────────────────────────────────────────────────────┘�[0m
2025-04-24T16:03:41.9522013Z 

I think this retrying is not necessary, and not correct either, since here the failure was in a step even before Maven is invoked (failure is in the generate-bindings.sh file). Due to this retry, the logs are polluted with a lot of symbol not found issues via Maven, and it masks the original problem that the Panama bindings were not properly generated.

Yeah, I agree, the main goal of the retry is to debug the issue of HNSW(has been resolved). The retry only happens when test fails. We can remove it at last.

rhdong avatar Apr 24 '25 20:04 rhdong

/ok to test cb5d1ba18d3431456f9468be5adf81b453823314

rhdong avatar Apr 24 '25 20:04 rhdong

Closes #845

cjnolet avatar Apr 25 '25 16:04 cjnolet

/ok to test b568c6a191f90cbf66465f69192a8eba90153f6c

rhdong avatar Apr 28 '25 18:04 rhdong

/ok to test 4aa812f07e6d8586614b057024c377db6188b90d

rhdong avatar Apr 29 '25 00:04 rhdong

The C++ tests failed here with a segfault.

https://github.com/rapidsai/cuvs/actions/runs/14720279064/job/41313393923?pr=805#step:9:1585

 CMake Error at run_gpu_test.cmake:35 (execute_process):
  execute_process failed command indexes:

    1: "Abnormal exit with child return code: Segmentation fault"

@cjnolet @rhdong Any ideas, please?

chatman avatar Apr 29 '25 03:04 chatman

/ok to test 8a00ccafde97524a864b0f502db336150bfc68ea

rhdong avatar May 01 '25 02:05 rhdong

Edit: Ignore this comment. It was based on my misunderstanding as to where the problem originated from.

-- Build files have been written to: /__w/cuvs/cuvs/java/internal/build
[1/2] Building C object CMakeFiles/cuvs_java.dir/src/cuvs_java.c.o
[2/2] Linking C shared library libcuvs_java.so
Starting Panama FFM API bindings generation ...
/opt/conda/envs/include does not exist.
Couldn't find a suitable CUDA include directory.

RAPIDS logger » [05/01/25 15:32:22]

Trying to find includes in /opt/conda/envs/include is surprising. The relevant parts from the script are:

CUDA_HOME=$(which nvcc | cut -d/ -f-4)

..

jextract \
 --include-dir ${REPODIR}/java/internal/build/_deps/dlpack-src/include/ \
 --include-dir ${CUDA_HOME}/targets/x86_64-linux/include \
 --include-dir ${REPODIR}/cpp/include \
 --output "${REPODIR}/java/cuvs-java/src/main/java22/" \
 --target-package ${TARGET_PACKAGE} \
 --header-class-name PanamaFFMAPI \
 ${CURDIR}/headers.h

Out of the three --include-dir params, I can't understand why any of those would point to /opt/conda/envs/include.

I have a few ideas:

  1. Can we revert the change from --include-dir ${CUDA_HOME}/targets/x86_64-linux/include to --include-dir ${CUDA_HOME}/include and see if that works? In my local system, both these works (even though the cuda_runtime.h is only available in the targets/x86_64-linux dir.

  2. Can we add more debug printing to understand how these variables like CUDA_HOME, REPODIR etc. are resolving? Consequent to that, shall we see if CUDA_HOME should be computed differently (instead of trying to find where nvcc lies)?

chatman avatar May 02 '25 14:05 chatman

fix-cudainclude.txt

Oh, I misunderstood where the error is coming from. It is erroring out much before jextract command.

Can we try the patch attached here?

chatman avatar May 02 '25 15:05 chatman