KexinFeng

Results 10 issues of KexinFeng

### 📚 The doc issue Hi, I'm trying the tutorial example of deploy and aim to package a model and do inference in C++. But I ran into a problems...

triaged
module: deploy

## Description ## This is the same as this https://github.com/apache/incubator-mxnet/pull/20559. The PR adds the support for fetching the gradients of intermediate variables in a gluon HybridizedBlock. This applies uniformly to...

pr-awaiting-testing

## Description This is shown in PR https://github.com/deepjavalibrary/djl-serving/pull/909. In the assertion in the unit test there, the output of sampling algorithm is different between testing it locally and testing it...

bug

I notice in the introduction that > torch::deploy (MultiPy for non-PyTorch use cases) is a C++ library that enables you to run eager mode PyTorch models in production without any...

### 📚 The doc issue Hi, I'm trying the tutorial example of deploy and aim to package a model and do inference in C++. But I ran into a problems...

Hi, If I understand the tree_search algorithm right, the dynamic programming process should be able to find the optimal number of generated tokens according to the acceptance-rate-vector. Also, given the...

Hi, I was trying to reproduce the numbers in the paper, but with the `demo-config.json`, plus the acceptance vector in the repo or the acceptance vector I tested myself, the...

Hi, I remember the support on vLLM was on your TODOs. Have you achieved it now? Was the main challenge in this direction that the batch size > 1 tree...

Tree attention mask is already supported in huggingface/transformers: https://github.com/huggingface/transformers/pull/27539 It will be very helpful for the speculative decoding applications. More sepcifically, in `flash_attn/flash_attn_interface.py#flash_attn_with_kvcache`, the tree attention mask will need to...

I have a question arising from reading the code. I notice that in `~/lightllm/models/llama2/layer_infer/transformer_layer_infer.py`, the flash attention is only applied to the prefilling stage, i.e. the `context_attention_fwd`, but not to...

bug