Qingquan Song
Qingquan Song
## 🐛 Bug version 1.3.1 1) Similar issue as: #531 when using the following code: when running the following code: ```py class MyModel(LightningModule): def __init__(self): self.metrics: ModuleDict[str, MetricCollection] = ModuleDict(...
Are we still able to get "thrift_library.bzl", "python_library.bzl", "python_unittest.bzl" as well as other bzl files? Thanks!
Hey Team, We're running some experiments with mistral 7b orpo and variants, but found that using GPT-4-1106-preview as baseline + openai gpt-4 judgement produce overly high results: ``` INFO:root:Not saving...
Hey, team, AO provides awesome FP8 support with torch compile to get speed and memory improvement, however since torch compile is not always easily applicable for some models such as...
Hey Team, I'm trying to use FSDP1/2 with Float8InferenceLinear but seems have some issues (with torch 2.3.1+cu118). Do you suggestion to bump to higher version of torch and have a...
### 🚀 The feature, motivation and pitch TVD is a good distance metric ([ref](https://aclanthology.org/2023.acl-long.605.pdf)) and easy to implement kernel to make the gradient more stable compared to KL divergence and...
### 🚀 The feature, motivation and pitch We've had implemented KL divergence and JSD loss. Thanks to the community! This feature request is to: add an optional feature for ignoring...
### 🚀 The feature, motivation and pitch FP8 Training has been a great weapon on H100 and provides huge memory and speed benefits, and has shown to be effective (with...
### 🚀 The feature, motivation and pitch W8A8 (int8 for both weight and activation) matmul is beneficial for A100 and could provide great memory and speed benefits, and could be...
### 🚀 The feature, motivation and pitch Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention https://arxiv.org/abs/2502.11089 Potentially useful python reference https://github.com/dhcode-cpp/NSA-pytorch ### Alternatives _No response_ ### Additional context _No...