Julio Perez comments

Results 42 comments of


                                            Julio Perez

[BUG] merlin models on vertex ai training - cuda error

Hello @tottenjordan what is the base driver version on the machine? Is that original picture of nvidia-smi output bare-metal or on the actual container?

[BUG] merlin models on vertex ai training - cuda error

@tottenjordan CUDA artifacts are loaded via the /opt/nvidia/nvidia_entrypoint.sh. I built your dockerfile, and it does not change your entrypoint. This means you should be loading the correct cuda version and...

[BUG] Data parallel training freezes due to different number of batches

So this is not the correct way to use the merlin dataloader with horovod. This requires a lot more background information. You should never be creating dataloaders in a for...

[BUG] Data parallel training freezes due to different number of batches

So I just ran this unit test: pytest tests/unit/loader/test_tf_dataloader.py::test_horovod_multigpu And it runs as expected. There are five partitions spread across two workers, so naturally one worker will get more partitions...

Getting error while connecting to merlin through docker container

@sejal9507 Can you try a more updated container. It seems your not able to load in the cuda version on the docker container. So it tries to rely on the...

Getting error while connecting to merlin through docker container

@sejal9507 OK so based on the information you gave, I think your main issue is that your version of CUDA is too old. You need to be on CUDA 10.1...

[Task] Set up performance metrics from integration tests for Merlin example notebooks

This is blocked on the following ticket: https://github.com/NVIDIA-Merlin/Merlin/issues/343 we need to refactor the way we leverage asvdb to accommodate for testbooks and non-notebook integration tests(

[RMP] Rework the NVT example notebooks to use other Merlin libraries?

we should also decide what to migrate to other repos and what to remove all together.

Establish a consistent memory allocation strategy for Tensorflow Memory

rename to allocate_tensorflow_memory add kw `type=dynamic | fixed | None` if default None it will use best based on tf version if fixed force use of tf_memory_allocation if dynamic try...

[RMP] Recsys Tutorial & Demo - Flesh out the multi-stage recommender example architecture

https://github.com/NVIDIA-Merlin/Merlin/pull/474