B. Shen

Results 8 issues of B. Shen

The `soft_cross_entropy` loss function in [TinyBERT](https://github.com/huawei-noah/Pretrained-Language-Model/blob/master/TinyBERT/task_distill.py#L905), [DynaBERT](https://github.com/huawei-noah/Pretrained-Language-Model/blob/master/DynaBERT/run_glue.py#L51) and other distilled models seems inaccurate. In the paper, it is said to be $CE(z^T/t, z^S/t)$, but the code is not functioning as...

## 🐛 Bug Report Some problems have their tags mismatched. ## To Reproduce Open some certain problems by vscode-leetcode extension, e.g., **1011.** ## Expected behavior `Array`, `Binary Search` :heavy_check_mark: Both...

Markmap saves me a lot of time, and I think it has the potential to replace the TOC 😆 However, we can jump to the position of the text by...

init_highway_pooler should only be called before training but not before evaluating.

The argument has an attribute of `do_distill` https://github.com/princeton-nlp/CoFiPruning/blob/793e3e1291827e2714b5de6d5c0b6b04bc1863e4/args.py#L43 while in run_qa_prune.py, `do_distil` (missing an 'l') is used https://github.com/princeton-nlp/CoFiPruning/blob/793e3e1291827e2714b5de6d5c0b6b04bc1863e4/run_qa_prune.py#L101

## 🐛 Bug The issue with paged kvcache under a specific head_dim has been fixed for the cuda target, but there are still problems with the opencl target after after...

bug

### Feature request Token averaging in gradient accumulation was fixed in #34191 . But token averaging in DDP seems to have the same issue. --- ## Expected behaivor With all...

Feature request

Why we force `recompute_granularity = 'selective'` when using `recompute_activations` ? I notice there are two choices `full` and `selective`. Is it recommended to choose `selective`? https://github.com/NVIDIA/Megatron-LM/blob/b428f80cd576f0e6a3b526c010c5b6014da69f7e/megatron/training/arguments.py#L292-L294

stale