Peng

Results 12 issues of Peng

### Describe the feature MoE模型里稠密层和专家层zero和并行的解耦 ### Will you implement it? - [ ] I would like to implement this feature and create a PR!

enhancement

### Describe the feature 实际使用过程中,不需要memory_pool,memory pool的逻辑可能和其他芯片的显存分配策略有冲突,建议统一去除memory pool的实现和使用,包括moe对memory pool的使用 ### Will you implement it? - [ ] I would like to implement this feature and create a PR!

enhancement

### Describe the feature a very simple on-the-fly dataloader is needed to support most pubic dataset ### Will you implement it? - [X] I would like to implement this feature...

enhancement

### Describe the feature Should we remove other dependency of flash-attention, and only keep the core attention related ops? If possible, we can only use pip to install flash-attention, avoiding...

enhancement

### Describe the feature CI should have a true no flashattention env ### Will you implement it? - [X] I would like to implement this feature and create a PR!

bug
enhancement

### Describe the bug ### Environment Torch2.1 ### Other information _No response_

bug

### Describe the bug we have a lot of cases like following: ` data = torch.empty(partition_size, dtype=tensor.dtype, device=torch.cuda.current_device(), requires_grad=False) ` where we directly use device=torch.cuda.current_device(). However, it is not recommended...

bug

### Describe the feature update readme with new version of dependency. ### Will you implement it? - [ ] I would like to implement this feature and create a PR!

enhancement

### Describe the feature supporting hugging-face modeling python file ### Will you implement it? - [ ] I would like to implement this feature and create a PR!

enhancement