KexinFeng issues

Results 10 issues of


                                            KexinFeng

Problems in built-from-source pytorch with USE_DEPLOY=1 in Ubuntu

### 📚 The doc issue Hi, I'm trying the tutorial example of deploy and aim to package a model and do inference in C++. But I ran into a problems...

triaged

module: deploy

[FEATURE] Add feature of attach_grad to nonleaf variables in HybridizedBlock.

## Description ## This is the same as this https://github.com/apache/incubator-mxnet/pull/20559. The PR adds the support for fetching the gradients of intermediate variables in a gluon HybridizedBlock. This applies uniformly to...

pr-awaiting-testing

Stochastic algorithm PR testing behaves different from local testing, for a fixed random seed.

## Description This is shown in PR https://github.com/deepjavalibrary/djl-serving/pull/909. In the assertion in the unit test there, the output of sampling algorithm is different between testing it locally and testing it...

bug

Queston: is there any examples of inferece with python models written with numpy (intead of pytorch models)?

I notice in the introduction that > torch::deploy (MultiPy for non-PyTorch use cases) is a C++ library that enables you to run eager mode PyTorch models in production without any...

Problems in built-from-source pytorch with USE_DEPLOY=1 in Ubuntu

### 📚 The doc issue Hi, I'm trying the tutorial example of deploy and aim to package a model and do inference in C++. But I ran into a problems...

Estimate the number of generated tokens per step from the acceptance-rate-vector?

Hi, If I understand the tree_search algorithm right, the dynamic programming process should be able to find the optimal number of generated tokens according to the acceptance-rate-vector. Also, given the...

Reproducibility: the tree_search generates too small tree

Hi, I was trying to reproduce the numbers in the paper, but with the `demo-config.json`, plus the acceptance vector in the repo or the acceptance vector I tested myself, the...

The support on vLLM?

Hi, I remember the support on vLLM was on your TODOs. Have you achieved it now? Was the main challenge in this direction that the batch size > 1 tree...

Any plans to support tree attention mask?

Tree attention mask is already supported in huggingface/transformers: https://github.com/huggingface/transformers/pull/27539 It will be very helpful for the speculative decoding applications. More sepcifically, in `flash_attn/flash_attn_interface.py#flash_attn_with_kvcache`, the tree attention mask will need to...

[Question] Flash attention only applies to prefilling stage

I have a question arising from reading the code. I notice that in `~/lightllm/models/llama2/layer_infer/transformer_layer_infer.py`, the flash attention is only applied to the prefilling stage, i.e. the `context_attention_fwd`, but not to...

bug