Guoteng

Results 13 comments of Guoteng

Hi, thank you for your interest in our project. As you mentioned, a global batch of 16M is indeed quite large. We commonly use configurations such as 512 GPUs with...

btw, because my trace.json data is very large, the opening of tensorboard is very slow, sometimes even OOM. So is there a way to output profiling data in raw str?...

2/27 UPDATE I copy resnet18 profiling example from [pytorch tutorial](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html) ```python import torch import torch.nn import torch.optim import torch.profiler import torch.utils.data import torchvision.datasets import torchvision.models import torchvision.transforms as T transform...

Hello @guotuofeng , Is there any progress on this bug?

It is speculated that the version of MLNX_OFED may be too low, and I am working on this.

After troubleshooting, I finally determined the cause of this problem. On the k8s training cluster in our lab, each compute node is equipped with two IB network cards, but only...

Hi Luca @lw great to hear your reply. > and would be interested in knowing more what you're doing with it). The main scenario where we use tensorpipe is reinforcement...

Looking forward to supporting torch2.1!

hello @rourouZ ,您好,看起来torchnpu输出的报错堆栈包含的有效信息不多,我们这边适配华为NPU使用的环境是: ``` torch: 2.1.0+cpu torch_npu: 2.1.0.post3+git7c4136d cann: 8.0.RC1.alpha003 ``` 您可以试试用这个环境跑下,我这边测试应该是ok的,如果您有任何问题internlm交流群@我也可以