How to process a video to processor?

Open ucaswindlike opened this issue 10 months ago • 3 comments

Mar 04 '25 03:03 ucaswindlike

@ucaswindlike ,

you can simply add more placeholder to your text prompt and add the same number of images in the list of this processor. A simple dummy example looks like this:

convs = [
    {"role": "system", "content": "You are agent that can see, talk and act."},            
    {"role": "user", "content": "<image_start><image><image_end><image_start><image><image_end><image_start><image><image_end>\nWhat is the letter on the robot?"},
]
prompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[image]*3, texts=prompt, return_tensors="pt")

Mar 04 '25 04:03 jwyang

Also, @jwyang will you be releasing a robotic action example ? :)

Mar 05 '25 00:03 rr3087

@rr3087 , Actually, we already included a robot action example in agents folder based on libero env. Please take a look! Ideally, it would be great to set up a gradio demo for robot manipulation, but we have not yet figured out how to do that.

Mar 05 '25 06:03 jwyang