TensorRT-LLM
TensorRT-LLM copied to clipboard
How to get output including context_logits with GPU tensors?
llm = LLM('/app/models/tensorrt_llm', skip_tokenizer_init=True)
sampling_params = SamplingParams(end_id=2, return_context_logits=True, max_new_tokens=1)
results = llm.generate([[32, 12,24,54,6,747]], sampling_params=sampling_params)
print(results)
print(results[0].context_logits)
GenerationResult(request_id=1, prompt_token_ids=[32, 12, 24, 54, 6, 747], outputs=[CompletionOutput(index=0, text='', token_ids=[], cumulative_logprob=None, logprobs=[])], finished=False)
tensor([[ -4.7734, -6.8086, -2.9629, ..., -4.6484, -5.6211, -5.0430],
[ 5.9062, 5.9453, 1.4648, ..., 9.1797, 7.4297, 7.4883],
[ 10.3906, 13.6094, 9.4766, ..., 13.9062, 11.4062, 12.7891],
[ 4.1172, 1.7715, -7.2344, ..., 2.8203, 5.0391, 2.3750],
[ 1.6025, -3.4180, -10.7422, ..., -2.5332, -0.2891, -1.4541],
[ 3.5684, 2.9492, -0.8101, ..., 4.0977, 4.3750, 2.9492]])
Context logit tensors are so huge but llm.generate is very slow because it return the tensors down from gpu to cpu.
How to get output including context_logits with GPU tensors?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
This issue was closed because it has been stalled for 15 days with no activity.