Abhay Shukla

Results 8 comments of Abhay Shukla

I was using a single machine with 4 GPUs and facing the same error because each process was only seeing GPU:0. Adding `os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"` solved the problem for me.

I looked at the weights of the checkpoint and they are all nan for models saved after loss drops to zero.

@rnyak Sure, will try it out. Do you suspect its exploding/vanishing gradient problem?

Training with `fp16=False` does seem to work fine.

There is no separate directory for NCCL at `/usr/local/cuda-11.2/` but nccl and libnccl files are present in `usr/local/cuda-11.2/include/` and `/usr/local/cuda-11.2/lib/` respectively. Setting `NCCL_DIR=/usr/local/cuda-11.2/` did not work either.

Your suggestion seems to work and cmake is able to locate NCCL but another error is coming now ``` Defaulting to user installation because normal site-packages is not writeable Collecting...

I don't see any documentation but SparkXGBRanker class is implemented at https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/spark/estimator.py. Is it ready for use?

**Update**: I installed java from https://download.oracle.com/java/20/latest/jdk-20_windows-x64_bin.exe, set JAVA_HOME environment variable and tried again. Getting the following error now ``` --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) Cell In[3], line 1...