piper icon indicating copy to clipboard operation
piper copied to clipboard

Alignment data should be exposed as one of the outputs

Open shaunren opened this issue 2 years ago • 8 comments

This is useful to determine e.g. the word boundaries in the output waveform.

shaunren avatar May 11 '23 16:05 shaunren

I am currently working on this and found the following things:

  • The word boundaries are not obtainable, because the sentences are synthesized as a whole
  • Synthesizing singular words and accumulating the length (including silent bytes) to get alignment data for individual words is possible but takes much longer and also is anything but accurate.
  • Synthesized sentences/words are of different length with each run.

My current implementation would output alignment data for sentences in CSV:

timestamp, word, start_index

orgarten avatar Dec 19 '23 06:12 orgarten

I found a solution to work around the problem:

  1. Use an HTTP request with a Piper Python web server to create a wave file with a sample rate of 20500.
  2. Downsample the wave file to 16000 using librosa.
  3. Post the wave data to the whisper.cpp web server for text recognition and alignment data.
  4. Convert the wave data to MP3 format and embed the alignment data in the ID3 lyrics tag.
  5. Output the MP3 file for the client.
  6. Client parses the alignment data from the ID3 tag.

It's so annoying, but it works. On my mac mini 4, the process used about 400ms, sometimes text recognition isn't very accurate, but I don't care, I only need the alignment data, it's accurate. because I already have the original text.

luispater avatar Jan 07 '25 17:01 luispater

By the way, I read the piper source code. The process involves converting text to piper-phonemize, which retrieves ids from espeak-ng. Then, the phonetic symbols are passed to ONNX. ONNX acts as a black box: it takes the phonetic symbols as input and produces audio data as output. There is no way to get the alignment data...

luispater avatar Jan 07 '25 18:01 luispater

My assumption is that the models can be trained with time stamped data for word boundaries and then they would know to output alignment data, but I don't know enough about this yet.

eeejay avatar Jan 10 '25 18:01 eeejay

Alignment data is obtainable from the original PyTorch models, but not the Onnx models currently. This would require re-exporting all the voice models (incompatible with existing Piper) as well as adjusting Piper's code.

synesthesiam avatar Jan 11 '25 02:01 synesthesiam

We created a workaround for rough alignment data in #407

orgarten avatar Jan 12 '25 10:01 orgarten

I created a straightforward approximate alignment for the audio. I developed a set of timing coefficients, where consonants are short and stressed vowels are long, and then stretched the sum of these coefficients for each phoneme to fit the length of the synthesized audio. Thankfully, this method works well even without an extra recognition step.

In fact, aligning the Whisper-recognized text (which included timestamps) to the phonemes was quite a challenge. It involved several steps: first, matching words with position penalties to split the word sequence, and then breaking it down by sentence endings. After that, there was a less confident phase followed by DTW alignment. Surprisingly, the simple algorithm using the coefficients produced results that were almost as good.

Phoneme relative durations: https://github.com/OpenVoiceOS/ovos-classifiers/blob/dev/ovos_classifiers/heuristics/phonemizer.py

This repo's issues:

  • #531
  • #407
  • #364
  • #425

Boorj avatar Jan 17 '25 11:01 Boorj

Thank you, @Boorj, for sharing your approach, really clever and pragmatic!

If I understand correctly, your method is entirely standalone and doesn’t require modifying Piper, right? You generate approximate phoneme durations by using pre-defined timing coefficients (longer for vowels, shorter for consonants), then stretch those durations proportionally to match the synthesized audio length.

That makes a lot of sense; especially when accurate enough for basic word/phoneme highlighting. I imagine this should generalize to other languages as well, provided there's a good phonemizer and a reasonable duration mapping.

I have a few of follow-ups: How did you determine the avg_durs values, are they based on empirical data (e.g., forced-alignment) or just heuristics - or could espeak-ng potentially help in this determination step? Is your phonemizer tied to OpenVoiceOS, or could it be used as a drop-in Python module with standard Piper workflows?

I'm thinking I could do something similar using espeak-ng (for phonemes) and a simple JS or Python script to generate timestamped spans from audio + text, without touching Piper itself. Your approach seems like the fastest way to get decent alignment with minimal complexity.

Thanks again for sharing this, really valuable!

isolveit-aps avatar Aug 06 '25 10:08 isolveit-aps