Naoto Usuyama

Results 44 comments of Naoto Usuyama

I also noticed NCCL_IB_DISABLE (env variable) is set to 1 by the pretrain AML environment (or maybe by the Docker image) ``` NCCL_IB_DISABLE The NCCL_IB_DISABLE variable disables the IB/RoCE transport...

When I tried the pretraining on ND24rs (RDMA/infiniband), I got the following error: ``` RuntimeError: NCCL error in: ... /torch/lib/c10d/ProcessGroupNCCL.cpp:290, unhandled system error ``` I think NCCL_IB_DISABLE should be set...

After checking with AzureML folks, it turned out I have to use Intel MPI as the backend when I use nodes without SR-IOV support. > SR-IOV stands for “single root...

From the name, this (https://github.com/Azure/AzureML-Containers/blob/master/base/gpu/openmpi3.1.2-cuda10.0-cudnn7-ubuntu16.04/Dockerfile ) looks like the base image; however, it seems there're additional steps for the AzureML-BERT image. Still looking for the Dockerfile.

Duplicate #29 (still open)

For now I created `bert_data/validation_512_only` folder and moved `wikipedia_segmented_part_98.bin` and it seems the training pipeline is working fine. Still would be great to use the updated files @jingyanwangms

U-net with image-net pretrained backbones should work well for natural scene images.

Thanks for a great suggestion! Please feel free to submit a PR.

Hi @prarobinson thanks for trying this repo. Which version of PyTorch are you using? I created this repo with PyTorch 0.4.1 so there might be some breaking changes with new...

Hmm that's interesting. Can you try to repro on an online notebook and share? e.g. https://colab.research.google.com/