TensorRT-LLM How to get output including context

llm = LLM('/app/models/tensorrt_llm', skip_tokenizer_init=True)

sampling_params = SamplingParams(end_id=2, return_context_logits=True, max_new_tokens=1)

results  = llm.generate([[32, 12,24,54,6,747]], sampling_params=sampling_params)

print(results)
print(results[0].context_logits)

GenerationResult(request_id=1, prompt_token_ids=[32, 12, 24, 54, 6, 747], outputs=[CompletionOutput(index=0, text='', token_ids=[], cumulative_logprob=None, logprobs=[])], finished=False)

tensor([[ -4.7734,  -6.8086,  -2.9629,  ...,  -4.6484,  -5.6211,  -5.0430],
        [  5.9062,   5.9453,   1.4648,  ...,   9.1797,   7.4297,   7.4883],
        [ 10.3906,  13.6094,   9.4766,  ...,  13.9062,  11.4062,  12.7891],
        [  4.1172,   1.7715,  -7.2344,  ...,   2.8203,   5.0391,   2.3750],
        [  1.6025,  -3.4180, -10.7422,  ...,  -2.5332,  -0.2891,  -1.4541],
        [  3.5684,   2.9492,  -0.8101,  ...,   4.0977,   4.3750,   2.9492]])

Context logit tensors are so huge but llm.generate is very slow because it return the tensors down from gpu to cpu.

How to get output including context_logits with GPU tensors?

Aug 29 '24 08:08 lkm2835

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

Sep 29 '24 02:09 github-actions[bot]

This issue was closed because it has been stalled for 15 days with no activity.

Oct 15 '24 02:10 github-actions[bot]