Songlin Yang

Results 46 comments of Songlin Yang

I am pretty sure that the reason is due to the "Chart" class, one should set cache=False if want to reuse the computation graph

Are there any convenient ways to set up the initial state for mamba? I wanna use TBPTT to train mamba on longer ctx size, so there is no need to...

Mine is normal. NVIDIA A100 80GB PCIe, Triton nightly release ``` Testing BFloat16... tensor([[ -3.0000, 4.3125, -5.2812, ..., -3.1094, 4.4062, 1.4141], [ -4.1875, 8.4375, 7.3750, ..., -4.2188, 0.7227, 4.2188], [...

Sorry we didn't have this support. I think it would be relatively easy to implement this by recursion.

I am afraid that it does not work in CFGs. logbmm can only pass two tensors, and the dimension of tensors is 3. For example, in compound PCFG, we have...

I found a slightly better way to reduce O(batch, n-w, w, A, B, C) to O(batch, n-w, A, B, C) Instead of combing B and C first, we can combine...

no, it is not an issue for dependency parsing since dependency parsing does not have "non-terminals". Dependency parsing can be regarded as lexicalized CFGs with non-terminals is Null for dependency...

btw, i found the autograd of pytorch uses amounts of gpu memories to calculate gradient. if I use linear-scan to explicitly implement the outside algorithm and use inside-outside algorithm to...