Voices

Fast in-process text to speech for Java 17 and above. No external apis. No system dependencies.

What is this?

An easy-to-use local text to speech library for Java.

It can produce reasonable quality audio using low-specced hardware.

It provides several components

Code to run the voice models from the piper and kokoro projects
A compatible pure Java phonemizer for English partially ported from phonemize
Compatible phoneme dictionaries for uk and us English
A multi-lingual phonemizer using the onnx model from OpenVoiceOs
A small number of models available as dependencies on maven central
Code to download other models not uploaded to central

The models are run using the onnxruntime library, so can utilise both CPU and GPU.

Which Model Should I Use?

The piper models are fast and very lightweight. The Kokoro models arguably produce better quality speech, but run approximately 4x slower but can be accelerated with a GPU. There seems to be little benefit using a GPU with the piper models.

Ultimately, which voice sounds better is a matter of personal taste.

Releases

See Releases

English-Only Usage With Piper & Rules Based Phonemizer

Using Voices with pipper requires three code dependencies and one or more models.

<!-- main dependency -->
<dependency>
    <groupId>org.pitest.voices</groupId>
    <artifactId>chorus</artifactId>
    <version>0.0.9</version>
</dependency>
<!-- a prepackaged model -->
<dependency>
    <groupId>org.pitest.voices</groupId>
    <artifactId>alba</artifactId>
    <version>0.0.9</version>
</dependency>
<!-- dictionary of pronunciations -->
<dependency>
    <groupId>org.pitest.voices</groupId>
    <artifactId>en_uk</artifactId> <!-- or en_us -->
    <version>0.0.9</version>
</dependency>
<!-- runtime for onnx models -->
<dependency>
    <groupId>com.microsoft.onnxruntime</groupId>
    <artifactId>onnxruntime</artifactId> <!-- or onnxruntime_gpu -->
    <version>1.22.0</version>
</dependency>

Technically, the rules based phonemizer can be used without a dictionary, but the quality of the speech would be poor.

The Chorus class acts as a manager for voice models, handling loading and freeing of resources. Loading is an expensive operation, so it is recommended to keep a single instance of Chorus for the lifetime of your application.

ChorusConfig config = chorusConfig(EnUkDictionary.en_uk());
try (Chorus chorus = new Chorus(config)) {
    Voice alba = chorus.voice(Alba.albaMedium());

    Audio audio = alba.say("Hello there, I'm vaguely Scottish.");
    audio.save(some path);
}

The example above uses a model retrieved at build time as a normal maven dependency.

A wider range of models can be retrieved at runtime by adding the model downloader dependency.

<dependency>
    <groupId>org.pitest.voices</groupId>
    <artifactId>model-downloader</artifactId>
    <version>0.0.9</version>
</dependency>

Models can be retrieved using the factory methods on the

org.pitest.voices.download.Models
org.pitest.voices.download.UsModels
org.pitest.voices.download.NonEnglishModels

Classes.

By default, voice models are downloaded to ~/.cache/voices/, but this can be configured in ChorusConfig.

Multi-lingual Usage

The OpenVoice phonemizer is much more capable than the rules-based one. It can be used without a dictionary to create good quality speech in multiple languages (including English).

It is more heavy-weight, using a 50mb model (compared to 3mb for a dictionary file), and is more computationally expensive.

Once the dependency has been added

<dependency>
    <groupId>org.pitest.voices</groupId>
    <artifactId>openvoice-phonemizer</artifactId>
    <version>0.0.9</version>
</dependency>

The phonemizer can be selected with

ChorusConfig config = chorusConfig(Dictionaries.empty())
        .withModel(new OpenVoiceSupplier());

Using Kokoro Models

The kokoro-runtime dependency provides the kokoro model and 11 voices. Usage and dictionary/phonmiser selection is then the same as for piper models,

<dependency>
    <groupId>org.pitest.voices</groupId>
    <artifactId>openvoice-phonemizer</artifactId>
    <version>0.0.9</version>
</dependency>

ChorusConfig config = chorusConfig(EnUkDictionary.en_uk());
try (Chorus chorus = new Chorus(config)) {
     Voice v1 = chorus.voice(KokoroModels.afSarah())
        .withSpeed(1.1f);

    Audio audio = alba.say("Kokoro also works!");
    audio.save(some path);
}

Running on GPU

Models can be run on GPU instead of CPU by using the onnxruntime_gpu dependency instead of onnxruntime. It is important that only the onnxruntime_gpu dependency is on the classpath. If the standard onnxruntime is also present the model will fail to load to gpu.

To activate the gpu, the gpuChorusConfig can be used.

ChorusConfig config = gpuChorusConfig(EnUkDictionary.en_uk());

This runs the model on gpu 0 with no other options set. More complex setups can be configured using the withCudaOptions method on ChorusConfig.

There seems to be little benefit using a gpu with the piper models, but inferences can be much faster for kokoro models.

Pauses

Voices will add pauses if it encounters the following markdown symbols

Markdown # Style Headings
Markdown --- Section breaks
Em dashes and Markdown --- em dashes
En dashes and Markdown -- en dashes

The defaults can be adjusted via the ChorusConfig class.

Hetronyms

Voice's hetronym (words with the same spelling but different pronunciations) handling can sometimes perform better than piper and espeak-ng thanks to part of speech tagging provided by the OpenNLP library.

Phrases such as

I moped on my moped.
I rebel because I am a rebel.
Sow the seeds for the sow to eat.

All use the correct pronunciations of the heterographs.

Licencing

Most of the project is licenced under Apache 2. The en_uk dictionary is released under GPL 3 due to a cautious interpretation of the licencing terms of the espeak-ng tool which was used to generate much of its content.

Although generally the GPL does not apply to the output of a program, it seems probable that feeding a word list to espeak-ng will result in it regurgitating a significant proportion of its own internal dictionary.

The en_us dictionary is of lower quality, but is generated by transforming the CMU dictionary which, whilst copyrighted by Carnegie Mellon University, is free to use so long as its copyright is acknowledged.

The models from the piper project are not part of this project and may have their own usage restrictions. Please check they match your use case.

Alternatives

Sherpa Onnx

The Sherpa-onnx project can also run piper models.

At the point this project was initiated, sherpa was difficult to consume as it was not available from maven central and required manual installation of native libraries. It also seemed to handle hetronyms poorly.

This situation may have since improved.

Mary TTS

Mary TTS is very mature and produces reasonable quality speech, however it sounds a little robotic by modern standards.

Development

Although much of the ported logic is not well tested, there are a splattering of tests to prevent major regression while changing things, and a few tests that by default play audio to allow experimentation.

If you're building from the command line, the audio can be disabled with.

mvn -Dsilent=true install

Background

I created this library to narrate my own writing as part of my editing loop. Initially it called the sherpa native libraries, but I kept coming back to the idea of writing a pure Java phonemizer as it would allow some degree of control over pauses, which are important for narrating fiction.

I have no background in text to speech or linguistics, so much of the functionality relies on work by other better qualified people.

voices
voices copied to clipboard

Metadata

Voices

What is this?

Which Model Should I Use?

Releases

English-Only Usage With Piper & Rules Based Phonemizer

Multi-lingual Usage

Using Kokoro Models

Running on GPU

Pauses

Hetronyms

Licencing

Alternatives

Sherpa Onnx

Mary TTS

Development

Background

← Metadata

Owner

Metadata

voices voices copied to clipboard

Metadata

Voices

What is this?

Which Model Should I Use?

Releases

English-Only Usage With Piper & Rules Based Phonemizer

Multi-lingual Usage

Using Kokoro Models

Running on GPU

Pauses

Hetronyms

Licencing

Alternatives

Sherpa Onnx

Mary TTS

Development

Background

← Metadata

Owner

Metadata

voices
voices copied to clipboard