Jay Zhuang

Results 34 issues of Jay Zhuang

Nice work! In the paper I saw this batched result: But examples like https://github.com/Infini-AI-Lab/TriForce/blob/main/test/on_chip.py only use batch size=1. Does the code supports batched speculative inference?

good first issue

In order to visualize some Maxwell simulation results (without writing to VTK), I hacked some code to plot vector field with zeroth-order Nedelec element on quadrilateral mesh. Would it be...

Here `lookahead_generation` doesn't take `logits_warper` as input: https://github.com/alipay/PainlessInferenceAcceleration/blob/8015f12f7fe32acc102bb3eb51c4f8b3a420e79c/pia/lookahead/common/pretrained_model_batch.py#L426-L439 `logits_warper` is used in original `sample` to modify `next_tokens_scores`: https://github.com/alipay/PainlessInferenceAcceleration/blob/8015f12f7fe32acc102bb3eb51c4f8b3a420e79c/pia/lookahead/common/pretrained_model_batch.py#L474-L486 and to modifies logits by temperature, top_k, top_p... ```python if generation_config.temperature is...

I attempted to swap-in FlashAttention for batched llama, by simply changing `self._attn()` to `self._sdp_attn()` inside `LlamaAttention.forward()`: https://github.com/alipay/PainlessInferenceAcceleration/blob/6280cb2f097ba0bc6bc423ab910b9de7ddbe3bf2/pia/lookahead/models/llama/modeling_llama_batch.py#L372-L375 https://github.com/alipay/PainlessInferenceAcceleration/blob/6280cb2f097ba0bc6bc423ab910b9de7ddbe3bf2/pia/lookahead/models/llama/modeling_llama_batch.py#L404-L407 where `_sdp_attn` is defined as: https://github.com/alipay/PainlessInferenceAcceleration/blob/6280cb2f097ba0bc6bc423ab910b9de7ddbe3bf2/pia/lookahead/models/llama/modeling_llama_batch.py#L327-L329 However the model generates wrong result....