Norman Mu

Results 10 comments of Norman Mu

What environment should I use? The environment in the comment you linked differs from what was suggested in the readme of this repo. I don't think torch/d2 versions are the...

Did anyone here resolve this issue?

I'm storing the textual metadata in a JSON field. Here is a quick tour of how they work: https://huggingface.co/docs/transformers/preprocessing. They expect strings as input and output dictionaries of int arrays...

Not always but often- the token types ids are more useful for specific NLP tasks. For my use case (replacing [this](https://github.com/facebookresearch/SLIP/blob/main/datasets.py#L100) dataset for CLIP training) we need the attention mask...

I tried validating the masking implementation with lm-eval-harness. ~~On HellaSwag, mamba-1.4b with right padding still achieves the reported 59.1% accuracy. Switching to left padding drops this to 55.8% accuracy, and...

@tridao do you think it would be feasible to implement masking by setting padded timesteps of the discretized A and B matrices to identity operators (i.e. all 1's for A...

Thanks, that makes sense. I didn't realize that `deltaB_u` was a linear transformation of `x`. I guess this approach doesn't technically handle internal pad tokens correctly but it works for...

I tested this out in the slow path of `Mamba.forward` by masking twice (once before the causal conv1d and once before the selective scan): ```diff class Mamba(nn.Module): ... def forward(self,...

It seems like the [PyTorch attention implementation](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) supports custom attention masks and also uses Flash-Attention 2: https://twitter.com/StasBekman/status/1736083447658225665. Though I'm not sure that passing in an attention mask doesn't cause the...