shijie liu

Results 15 comments of shijie liu

Hi @iidsample Thanks for trying out HugeCTR! About the [multinot-training tutorial](https://github.com/NVIDIA-Merlin/HugeCTR/tree/master/tutorial/multinode-training), unfortunatly it's currently out of data and will be removed in next release. For now, we provide docker image...

The key idea for launching multi-node training in HugeCTR is to use mpi. Like https://github.com/NVIDIA-Merlin/HugeCTR/blob/master/tutorial/multinode-training/run_multinode.sh#L110 suggests. So the steps can be: 1. install and configure mpi in a bunch of...

Hi @iidsample Could you provide more detailed log and scripts? THX!

Hi @raghavendrachari08 ``` [1695144496.484294] [hpci5201:103648:0] ib_device.c:1250 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.160.0.55 sgid_index=3 traffic_class=106) for UD verbs connect on bnxt_re0 failed: Connection timed out [hpci5201:103648] pml_ucx.c:419 Error: ucp_ep_create(proc=1) failed:...

> The broadcast layer requires that the first dimension must be the same, but in your project is 2048 and 2018, I don't know if the Tile Layer can solve...

hi @ramgandikota you can use nsight system to get profile result. Please refer to [Nsight System Doc](https://docs.nvidia.com/nsight-systems/) for more detailed use guidance. Here is an example: ``` nsys profile -s...

Hi @mia1460 thanks for your attention to our roadmap! We do not plan to support FP32. Instead, we are focusing on lower precisions such as FP8 in the roadmap. For...

[CI](https://gitlab-master.nvidia.com/Devtech-Compute/distributed-recommender/-/pipelines/32517943)

@jiashuy @z52527 before we merge this PR, we need to add doc and update example

> terminate called after throwing an instance of 'std::bad_alloc' It seems like there is a memory allocation error. Could you try decrease the batch_size or seqlen?