deepvariant Issues with Incompatible TensorRT libraries in docker image google/deepvariant:latest-gpu and google/deepvariant:1.6.1-gpu

Hello,

I've been trying to set up the google/deepvariant:1.6.1-gpu or google/deepvariant:latest-gpu image on a GPU instance, but I've encountered the error message mentioned below when running the run_deepvariant or train scripts, and despite generating the flags (screenshot) as expected, I believe those incompatible/missing TensorRT libraries are preventing these scripts from using the GPU.

Command used: sudo docker run --runtime=nvidia --gpus 1 google/deepvariant:1.6.1-gpu train --help

Error message:

2024-05-08 15:11:26.358196: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libcublas.so.12: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2024-05-08 15:11:26.358229: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

May 08 '24 15:05 nlopez94

@nlopez94 ,

We also observe these warnings, but, DeepVariant does not use any TensorRT apis for training or inference. So these warning are usually non actionable for the deepvariant pipeline. Are you running inference and seeing the machine's GPU is not being utilized?

May 08 '24 16:05 kishwarshafin

Hi @kishwarshafin,

The training process starts as expected with GPU activity visible, but it abruptly stops without any error message while processing the first epoch and determining the best checkpoint metric (code snippet). This step completes as expected when using the CPU image with the same dataset and parameters. Initially, I thought TensorRT issues might be causing this stop, but I'll share the logs with you to get your perspective and an extra set of eyes on the problem.

Command:

( time sudo docker run --runtime=nvidia --gpus 1\
    -v ${HOME}:${HOME} \
    -w ${HOME} \
    google/deepvariant:1.6.1-gpu \
     train \
     --config="${BASE}/dv_config.py":base \
     --config.train_dataset_pbtxt="${BASE}/training_set.pbtxt" \
     --config.tune_dataset_pbtxt="${BASE}/validation_set.pbtxt" \
     --config.init_checkpoint="${BASE}/checkpoint/deepvariant.wgs.ckpt" \
     --config.num_epochs=10 \
     --config.learning_rate=0.0001 \
     --config.num_validation_examples=0 \
     --experiment_dir=${TRAINING_DIR} \
     --strategy=mirrored \
     --config.batch_size=512 \
 ) > "${LOG_DIR}/train.log" 2>&1 &

I0508 17:53:46.544947 140534986602304 train.py:384] Starting epoch 0
I0508 17:53:46.545100 140534986602304 train.py:391] Performing initial evaluation of warmstart model.
I0508 17:53:46.545171 140534986602304 train.py:361] Running tune at step=0 epoch=0
I0508 17:53:46.545287 140534986602304 train.py:366] Tune step 0 / 15 (0.0%)
I0508 17:54:10.069682 140512707213056 logging_writer.py:48] [0] tune/categorical_accuracy=0.22617188096046448, tune/categorical_crossentropy=1.3209192752838135, tune/f1_het=0.02283571846783161, tune/f1_homalt=0.09889934211969376, tune/f1_homref=0.843934178352356, tune/f1_macro=0.3218897581100464, tune/f1_micro=0.22617188096046448, tune/f1_weighted=0.21346084773540497, tune/false_negatives_1=6123.0, tune/false_positives_1=5727.0, tune/loss=1.3209190368652344, tune/precision_1=0.21375617384910583, tune/precision_het=0.19323670864105225, tune/precision_homalt=0.05127762258052826, tune/precision_homref=0.9494163393974304, tune/recall_1=0.20273438096046448, tune/recall_het=0.007176175247877836, tune/recall_homalt=0.834269642829895, tune/recall_homref=0.6971428394317627, tune/true_negatives_1=9633.0, tune/true_positives_1=1557.0
I0508 17:54:10.083408 140534986602304 train.py:394] Warmstart checkpoint best checkpoint metric: tune/f1_weighted=0.21346085

real    1m12.933s
user    0m0.037s
sys     0m0.013s

train.log

May 08 '24 18:05 nlopez94

@nlopez94 can you cat validation_set.pbtxt and see how many examples you have in the tune data? It looks like everything ended regularly but there's too little data.

May 08 '24 18:05 kishwarshafin

@kishwarshafin As I mentioned before, the same dataset and parameters were used when I ran this on CPU, and as I indicated earlier, this process continued without abruptly ending as it did when I ran it with GPU. Below you can find what's on my validation_set.pbtxt

# Generated by shuffle_tfrecords_lowmem.py

name: "ASM3060704"
tfrecord_path: "/home/nlopez/training-case-study/customized_training/validation_set.with_label.shuffled-?????-of-?????.tfrecord.gz"
num_examples: 7762
#
# --input_pattern_list=/home/nlopez/training-case-study/customized_training/validation_set.with_label.tfrecord-?????-of-00024.gz
# --output_pattern_prefix=/home/nlopez/training-case-study/customized_training/validation_set.with_label.shuffled
#
# class1: 5628
# class0: 1774
# class2: 360

May 08 '24 18:05 nlopez94

@nlopez94 can you remove this parameter: --config.num_validation_examples=0 and rerun please

May 08 '24 18:05 kishwarshafin

@kishwarshafin I will try this and update you on the results I get. Thank you so much for the support!

May 08 '24 19:05 nlopez94

@kishwarshafin I just found the error that was causing this to abruptly exit without warning. I was running the script with insufficient memory, and after changing my instance type, everything ran as expected. Thank you very much for answering my questions!

May 08 '24 19:05 nlopez94

Hi,In fact,you can find the path of above missing library and add them into LD_LIBRARY_PATH the the warning will be eliminated. I have the same problem with you ,and solve it by this way.

May 11 '24 12:05 kunmonster

Thanks for confirming @nlopez94, I will close the issue.

May 11 '24 16:05 kishwarshafin