Support batch size > 1

Open xwang365 opened this issue 1 year ago • 0 comments

Support BatchSize > 1

This PR suppose to support batch size > 1 for Medusa inference model. This is only a draft for now and need further improvement.

Main change

Update update_inference_inputs tree_decoding generate_candidates to support bs>1

using [PAD] to fill the inputs, such as:

prompt:
          A B C [PAD] [PAD]
          D E F   H     I

plan to squeeze the [PAD] in the middle of inputs when add new tokens to the sequence, such as:

prompt:                                 new_tokens:
          A B C [PAD] [PAD]               X         Y
          D E F   H     I                 Z      [PAD]

new sequence:

          A B C X Y [PAD]    
          D E F H I   Z

Test:

 python -m medusa.inference.inference_test --model 'FasterDecoding/medusa-vicuna-7b-v1.3'

This PR is marked as draft as there is more work required to get this into a mergable state.

Feb 19 '24 02:02 xwang365