CosyVoice icon indicating copy to clipboard operation
CosyVoice copied to clipboard

胡言乱语啊

Open PressEsync opened this issue 3 months ago • 3 comments

CosyVoice2-0.5B 我试着输入一段固定文字和不同说话人的参考音频进行批量合成,为了实验合成语音对说话人认证系统的突破成功率。但是我发现合成的语音基本都在胡言乱语,没有按照给出的文字内容合成,而且语速和音频长度都不一致,偶尔有一条语音按照给定文字合成了,还只合成了一半。请问这是怎么回事呢?

PressEsync avatar Nov 07 '25 12:11 PressEsync

CosyVoice2-0.5B 我试着输入一段固定文字和不同说话人的参考音频进行批量合成,为了实验合成语音对说话人认证系统的突破成功率。但是我发现合成的语音基本都在胡言乱语,没有按照给出的文字内容合成,而且语速和音频长度都不一致,偶尔有一条语音按照给定文字合成了,还只合成了一半。请问这是怎么回事呢?

代码贴下?

ScottishFold007 avatar Nov 11 '25 02:11 ScottishFold007

CosyVoice2-0.5B 我试着输入一段固定文字和不同说话人的参考音频进行批量合成,为了实验合成语音对说话人认证系统的突破成功率。但是我发现合成的语音基本都在胡言乱语,没有按照给出的文字内容合成,而且语速和音频长度都不一致,偶尔有一条语音按照给定文字合成了,还只合成了一半。请问这是怎么回事呢?

代码贴下?

不胡言乱语了,我的音频参考文字错配了,已解决。 但是合成的音色不太对,女声被合成为男声,部分音频语速偏慢,哪里搞错了吗? 我传入的参考音频长度都为6s,原始采样率为16kHz。 代码: def df_cosyvoice(speaker_wav_path="./audio_clean", output_dir="./output_clean_cosyvoice", device="cuda", logger=None): config = GlobalConfig() TEXT_TO_SPEAK = config.text_cosy

cosyvoice = CosyVoice2(
    './CosyVoice/pretrained_models/CosyVoice2-0.5B',
    load_jit=False,
    load_trt=False,
    load_vllm=False,
    fp16=False
)

wav_files = [f for f in os.listdir(speaker_wav_path) if f.lower().endswith(".wav")]

logger.info(f"共检测到 {len(wav_files)} 个语音文件,开始批量生成...")

for wav_file in tqdm(wav_files, desc="Generating speech", ncols=100):
    speaker_file = os.path.join(speaker_wav_path, wav_file)
    file_name_noext = os.path.splitext(wav_file)[0]
    output_file = os.path.join(output_dir, f"{file_name_noext}_cosyvoice.wav")

    text_file = os.path.join(speaker_wav_path, f"{file_name_noext}.txt")
    if os.path.exists(text_file):
        with open(text_file, "r", encoding="utf-8") as f:
            text_to_speak = f.read().strip()
    else:
        logger.warning(f"文字文件不存在: {text_file},使用空字符串")
        text_to_speak = ""

    waveform, sr = resample_to_target(speaker_file, target_sr=cosyvoice.sample_rate)
    prompt_speech = waveform

    for i, j in enumerate(cosyvoice.inference_zero_shot(TEXT_TO_SPEAK, text_to_speak, prompt_speech, stream=False, text_frontend=True)):
        torchaudio.save(output_file, j['tts_speech'], cosyvoice.sample_rate)

    logger.info(f"生成语音: {output_file},参考语音: {speaker_file}")

logger.info("批量语音生成完成。")

PressEsync avatar Nov 11 '25 11:11 PressEsync

Transformer version too high may lead this issue, "wuluwulu",can not hear anything~~

dulingkang avatar Dec 09 '25 09:12 dulingkang