Guoteng issues

Results 9 issues of


                                            Guoteng

WIP:feature(wgt): enable ding using torch-rpc

## Description DI-engine integrates torch.distributed.rpc module. 1. CPU-P2P-RDMA: In IB network environment, support RDMA CPU-P2P transmission 2. GPU-P2P-RDMA: supports GPU p2p communication ## Related Issue ## TODO 1. Dynamic communication...

efficiency optimization

enhancement

feat(config): add checkpoint_fraction into config

## Motivation We have `checkpoint_fraction` but have no interface in config file, this PR will support this. ## Modification 1. add `checkpoint_fraction` option into model config. 2. add `checkpoint_fraction` sanity...

enhancement

feature(wgt): enable DI using torch-rpc to support GPU-p2p and RDMA-rpc

Commit: 1. Add torchrpc message queue. 2. Implement buffer based on CUDA-shared-tensor to optimize the data path of torchrpc. 3. Add 'bypass_eventloop' arg in Task() and Parallel(). 4. Add thread...

efficiency optimization

Module view dose not show device time

Hi guys, I'm recently trying to use ` torch.profile` for profiling of a large NLP model. However, I have encountered some problems and would like to get some advice: 1....

bug

plugin

Error: "transport retry counter exceeded" when torch.distributed.rpc.init_rpc between different pods in k8s

Hello, my code is running in the k8s environment. I started pytorch in two pods and tried to use torchrpc , but I encountered an error in the torch.distributed.rpc.init_rpc function....

Select ibv device who has active port_state.

If the deviceList contains multiple ibv devices, we want to select the device of the port whose port_state is active, instead of just selecting the first device in the deviceList...

cla signed

Building Pytorch release 2.1 + Glake failed

Very cool work, we really hope to use Glake in our LLM training. However, I failed when trying to compile glake on pytorch release 2.1. My system information and error...

feat(simulator): support parallel cost simulator for internevo

# InternLM Simulator ## 1. Introduction The solver mainly consists of two components: 1. `profiling`: Collects the time consumption of each stage during the model training process in advance and...

【Question】Question about initial finetune loss

Hello, recently I read a [blog](https://mp.weixin.qq.com/s/J-EP6ZOeLS_lFZFD3oTNtA) about colossial supporting lora finetune deepseek-v3, it is a very great work for opensource community. But I have a question about the picture in...