Jia Wei

Results 9 issues of Jia Wei

Hi, guys. I tried to combine DALI with the torch.autograd.graph.saved_tensors_hooks(pack_hook,unpack_hook) API to speed up the offloading and prefetching of intermediate feature maps to SSDs. I converted the Pytorch tensor to...

help wanted
GDS
perf

My Code: ``` """run.py:""" #!/usr/bin/env python import os import sys import torch import torch.distributed as dist import time from torch.multiprocessing import Process # """Blocking point-to-point communication.""" # def run(rank, size):...

**Describe the bug** If the user has installed the Chinese version of Visual Studio, it may occur an error when running the "Getting Started" program. In the file "getting_started.ipynb", when...

**Describe the bug** I replace `[12] tester = np.random.rand(2000, 4000)` by ``` import torch tester = torch.rand(2000,4000) tester ``` and `%timeit csdfg(A=tester, N=np.int32(2000))` by `%timeit csdfg(A=tester,N=2000)` which means use torch.tensor()...

For some well known reason we don't have direct access to huggingface.co, we usually use : export HF_ENDPOINT="https://hf-mirror.com" to get around this. But this method fails when we execute the...

Nowadays, I compare the txt2img examples between baseline.py and mii.py, the amazing result occurs, the baseline is even faster than mii. The baseline inference result is: `(wjtorch2.0.1) lthpc@lthpc-C01:~/nvmessd/wj/DeepSpeed-MII/mii/legacy/examples/benchmark/txt2img$ python baseline-sd.py...

### Describe the question. **I try to compare dali and pytorch preprocess speed use the followed code:** ''' import torchvision.transforms as transforms import torchvision.datasets as datasets import torch from time...

question

### System Info When i use verl (vllm+fsdp+vllm ascend) train the qwen-30B model, an error occurs: `File "/usr/local/python3.11.13/lib/python3.11/site-packages/transformers/models/qwen3_moe/modeling_qwen3_moe.py", line 363, in forward hidden_states = residual + hidden_states ~~~~~~~~~^~~~~~~~~~~~~~~ RuntimeError: The...

bug

I see the nvlink is a requirement in this repository. But i wonder if deepep can be used on totally pcie 8-h800 server for single node training ? It cannot...