rai icon indicating copy to clipboard operation
rai copied to clipboard

Support audio in multimodal messages

Open rachwalk opened this issue 1 year ago • 6 comments

Is your feature request related to a problem? Please describe.

LLM APIs have started supporting audio input, so it would be beneficial for RAIMultimodalMessages to support audio as well.

Describe the solution you'd like MultimodalMessage class (https://github.com/RobotecAI/rai/blob/5d3a8f33f20e6ccfebf4fecceb3ef7d2bc70d0d1/src/rai/rai/messages/multimodal.py#L38) should support audio input.

Describe alternatives you've considered

This is the only suitable solution within the current architecture.

Additional context

rachwalk avatar Jan 16 '25 15:01 rachwalk

from the issue I understood that the changes are to mede in the messages/multimodal.py and the changes to be made are:

  1. delete the if self.audios not in [None, []]: check that was blocking audio support
  2. add support for base64 encoded audio files in the __init__ method
  3. create audio content entries similar to how images are handled using appropriate mime type for audio (e.g "audio/wav")

should i create a pull request with these changes?

please assign this issue. ill work on it and create a pr If i'm missing out on something, please let me know

mdimado avatar Jan 17 '25 02:01 mdimado

Hi @mdimado, yes, please feel free to create a PR for this task! A fully completed implementation should include:

  1. A preprocess_audio function, similar to preprocess_image, to handle conversion of various audio formats (e.g., .mp3, .wav, np.array with sampling rate) into a standard format accepted by multimodal vendors.
  2. Validation to ensure the model can process and understand the provided audio content (e.g., compatibility with gpt-4o-audio-preview).

Let me know if you need any further clarification or assistance (here and/or on discord)

maciejmajek avatar Jan 17 '25 09:01 maciejmajek

thanks for the clarification and additional details. after reviewing the task, i realize implementing the preprocess_audio function and handling validations might need more learning on my part. to ensure timely and high-quality work, i think someone with more expertise could handle this better. apologies for the inconvenience, and i kindly request to unassign myself for now.

mdimado avatar Jan 17 '25 11:01 mdimado

Hey @mdimado, no worries at all! We're all here to learn and grow together—that's what makes this such a great environment. 😊 Feel free to tackle any part of the work you're comfortable with, and don't hesitate to ask for guidance along the way. We’re always happy to help and support you through the process. Looking forward to it! 🚀

maciejmajek avatar Jan 17 '25 14:01 maciejmajek

@mdimado I have created sub-issues based on your task description: https://github.com/RobotecAI/rai/issues/373 feel free to comment under it so I can assign you.

rachwalk avatar Jan 17 '25 14:01 rachwalk

Due to widespread lack of support for audio, we are postponing this feature.

maciejmajek avatar Apr 17 '25 15:04 maciejmajek