We're actively maintaining and modernizing this classic codebase! 🔧 – Now Easier to Use in MoChaBench

Open congwei1230 opened this issue 7 months ago • 0 comments

Huge thanks to the original authors of SyncNet for this awesome codebase!

We use SyncNet as a standard evaluation metric for Talking Character Generation models.

To support broader adoption and easier integration into modern pipelines, we're actively maintaining and modernizing this classic implementation.

👉 Check out the updated version here: MoChaBench

If you find our work helpful, please consider citing both the original SyncNet and MoCha in your research. Your support means a lot!

The implementation follows a Hugging Face Diffusers-style structure. We provided a SyncNetPipeline Class, located at eval-lipsync\script\syncnet_pipeline.py.

You can initialize SyncNetPipeline by providing the weights and configs:

pipe = SyncNetPipeline(
    {
        "s3fd_weights":  "path to sfd_face.pth",
        "syncnet_weights": "path to syncnet_v2.model",
    },
    device="cuda",          # or "cpu"
)

The pipeline offers an inference function to score a single pair of video and speech. For fair comparison, the input speech should be a denoised vocal source extracted from your audio. You can use seperator like Kim_Vocal_2 for general noise remvoal and Demucs_mdx_extra for music removal

av_off, sync_confs, sync_dists, best_conf, min_dist, s3fd_json, has_face = pipe.inference(
    video_path="path to video.mp4",   # RGB video
    audio_path="path to speech.wav",   # speech track (must be denoised from audio, ffmpeg-readable format)
    cache_dir= "path to store intermediate output",    # optional; omit to auto-cleanup intermediates
)

May 26 '25 15:05 congwei1230