MeloTTS icon indicating copy to clipboard operation
MeloTTS copied to clipboard

Calling melo CLI for "ZH" long coldstart times, even if cached

Open zihaolam opened this issue 1 year ago • 0 comments

running this command, gets me a consistent output in approx 7 seconds: melo 我的名字叫小杨 dog.wav --language ZH

/Users/zihaolam/Projects/tts-editor/MeloTTS/melo/main.py:71: UserWarning: You specified a speaker but the language is English.
  warnings.warn("You specified a speaker but the language is English.")
loading pickled model from cache
loaded pickled model from cache, took 8.529947996139526
 > Text split to sentences.
我的名字叫小杨
 > ===========================
  0%|                                                                  | 0/1 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/j4/zkddp3ms6493qzbf3qf7rfwr0000gn/T/jieba.cache
Loading model cost 0.406 seconds.
Prefix dict has been built successfully.
Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
/Users/zihaolam/Projects/tts-editor/MeloTTS/.venv/lib/python3.9/site-packages/torch/nn/functional.py:4522: UserWarning: MPS: The constant padding of more than 3 dimensions is not currently supported natively. It uses View Ops default implementation to run. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Pad.mm:472.)
  return torch._C._nn.pad(input, pad, mode, value)
/Users/zihaolam/Projects/tts-editor/MeloTTS/melo/commons.py:123: UserWarning: MPS: no support for int64 for min_max, downcasting to a smaller data type (int32/float32). Native support for int64 has been added in macOS 13.3. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/ReduceOps.mm:612.)
  max_length = length.max()
100%|██████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.51s/it]
def get_model_pkl_path(language: str):
    return os.path.join(os.path.dirname(__file__), f"model_{language}.pkl")


def get_model(language: str, device: str):
    model_pkl_path = get_model_pkl_path(language)
    if not os.path.exists(model_pkl_path):
        from melo.api import TTS

        model = TTS(language=language, device=device)
        with open(model_pkl_path, "wb") as f:
            pickle.dump(model, f)
    else:
        with open(model_pkl_path, "rb") as f:
            start = time.time()
            print("loading pickled model from cache")
            model = pickle.load(f)
            print("loaded pickled model from cache, took ", time.time()-start)
    return model

Using pickle for TTS Model still does not help and takes approx 7 seconds for TTS for a short sentence.

Is there a way to improve the speed or further cache anything to reduce this cold start?

The gradio web UI takes approx 1 second to generate the same text. However, I would like to use the CLI instead of running a python server. Is there a way to optimise anything such that the CLI takes same time as the web UI/server?

zihaolam avatar May 11 '24 00:05 zihaolam