Qingquan Song issues

Results 12 issues of


                                            Qingquan Song

Metric not moved to device and invalids the cpu-gpu offloading when combining with DeepSpeed

## 🐛 Bug version 1.3.1 1) Similar issue as: #531 when using the following code: when running the following code: ```py class MyModel(LightningModule): def __init__(self): self.metrics: ModuleDict[str, MetricCollection] = ModuleDict(...

bug / fix

help wanted

Priority

waiting on author

v1.3.x

Add .bzl files for thrift and python library/test

Are we still able to get "thrift_library.bzl", "python_library.bzl", "python_unittest.bzl" as well as other bzl files? Thanks!

Overly High Win Rate for Alpaca v2 on mistral 7b orpo

Hey Team, We're running some experiments with mistral 7b orpo and variants, but found that using GPT-4-1106-preview as baseline + openai gpt-4 judgement produce overly high results: ``` INFO:root:Not saving...

[Feature Request] Fused fp8 matmul kernel (quant + dequant + matmul)

Hey, team, AO provides awesome FP8 support with torch compile to get speed and memory improvement, however since torch compile is not always easily applicable for some models such as...

Question: How to use Float8InferenceLinear with FSDP1/2?

Hey Team, I'm trying to use FSDP1/2 with Float8InferenceLinear but seems have some issues (with torch 2.3.1+cu118). Do you suggestion to bump to higher version of torch and have a...

float8

inference

Add TVD (Total variation distance) Kernel

### 🚀 The feature, motivation and pitch TVD is a good distance metric ([ref](https://aclanthology.org/2023.acl-long.605.pdf)) and easy to implement kernel to make the gradient more stable compared to KL divergence and...

feature

Adding ignore index support for divergence losses

### 🚀 The feature, motivation and pitch We've had implemented KL divergence and JSD loss. Thanks to the community! This feature request is to: add an optional feature for ignoring...

feature

[feat] FP8 Matmul Training Kernel

### 🚀 The feature, motivation and pitch FP8 Training has been a great weapon on H100 and provides huge memory and speed benefits, and has shown to be effective (with...

feature

[feat] Int8 Matmul Training kernel

### 🚀 The feature, motivation and pitch W8A8 (int8 for both weight and activation) matmul is beneficial for A100 and could provide great memory and speed benefits, and could be...

feature

DeepSeek Native Sparse Attention (NSA) Kernel

### 🚀 The feature, motivation and pitch Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention https://arxiv.org/abs/2502.11089 Potentially useful python reference https://github.com/dhcode-cpp/NSA-pytorch ### Alternatives _No response_ ### Additional context _No...

help wanted

feature

fun