mlx-examples icon indicating copy to clipboard operation
mlx-examples copied to clipboard

mlx_whisper: add support for audio input from stdin

Open anthonywu opened this issue 1 year ago • 2 comments

Problem

I wanted to pipe an audio file to mlx_whisper, but found it only accepted file paths. This PR will allow mlx_whisper to accept stdin and pass it to ffmpeg accordingly then allow the rest of the workflow to go on as usual.

Changes

  1. load_audio helper adjusts ffmpeg flags based on file path vs. stdin mode
  2. CLI parser will gracefully omit the otherwise-required positional audio arg if stdin is determined to be active
  3. optionally, --input-name arg is supported to help users name the otherwise anonymous stdin content (cannot guess from file path)
  4. added tests in macOS standard zsh file to drive and test the changes from the CLI

Process

  1. ran black and pre-commit on changes prior to PR
  2. python test.py shows 4 errors, some regarding floating point comparisons. Looks very far away from my change, may be known issues.

anthonywu avatar Oct 03 '24 10:10 anthonywu

Thanks for the addition. What do you think about a couple modifications:

  • For piping from stdin use - as in mlx_whisper - . That is what we do in MLX LM so it is more consistent.
  • The argument --input-name is confusing to me. I understand it now but I think it will in general be confusing. It might be more clear to allow an optional --output-names argument with appropriate defaults (basename when available or output when not).

awni avatar Oct 03 '24 14:10 awni

  • For piping from stdin use - as in mlx_whisper - . That is what we do in MLX LM so it is more consistent.

Done. I agree self consistency between related projects is worth more than aesthetic preferences. This does have the nice effect of eliminating test cases.

The only tradeoff is users who reflexively think they can pipe anything into any tool's bare name will have to read the docs.

  • The argument --input-name is confusing to me. I understand it now but I think it will in general be confusing. It might be more clear to allow an optional --output-names argument with appropriate defaults (basename when available or output when not).

I've come around to --output-name and have proposed a "template" solution that preserves existing behavior, but also leaves room for future improvements such as fancy rename strategies based on transcribed audio content, or allows for power users to produce diff variations of output names when they use the same audio_path but use diff parameters.

anthonywu avatar Oct 04 '24 01:10 anthonywu