DiffTalk The usage of RAM is always increasing during one epoch.

After preprocessing of HDTF dataset, I got 415 videos. 249 videos (60%) were randomly selected as training set, the others (40%) were test set. The first 1500 frames of each video were extracted for training with stride 2. So, I got 277,117 frames in training set, and 179,711 frames in test set.

My machine has 4 A100 GPUs with 40GB VRAM, and 377GB RAM and 72GB Swap. In training, the batch size is set to 16. At the first epoch, the usage of RAM is always increasing. At step 2743, all RAM was occupied (even the Swap space) and the training stopped. Thus, 2743 * 16 * 4 = 175,552 is the max number of frames can be used in training for my machine, and the test set was not token into account. I tried to reduce the number of frames of both training and test set to 10,000 frames, and the training process is OK.

Questions @sstzal :

Did you meet the same problem in your training?
If so, how did you solve the problem?
Is it possible to release the weights of diffusion model?

I guess the reason of this problem is that there are too much log during training.

Aug 10 '23 01:08 quqixun

Hello, may I ask how the signal features of your audio are extracted

Aug 16 '23 11:08 rjc7011855

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

Aug 17 '23 01:08 quqixun

hello, How do you get landmarks? please

Aug 17 '23 02:08 xz0305

@xz0305 It is quite simple to get 68 landmarks using dlib. http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2

import cv2
import dlib
import numpy as np


class LandmarksExtractor(object):

    def __init__(self, model_path):

        self.detector = dlib.get_frontal_face_detector()
        self.predictor = dlib.shape_predictor(model_path)

    def forward(self, image, is_rgb=True):

        if not is_rgb:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        landmarks = self.__predict(image)

        return landmarks

    def __predict(self, image):
        faces = self.detector(image, 1)
        assert len(faces) > 0
        face = faces[0]
        landmarks = self.predictor(image, face)
        landmarks = self.shape_to_np(landmarks)
        return landmarks

    @staticmethod
    def shape_to_np(shape, dtype=int):
        coords = np.zeros((68, 2), dtype=dtype)
        for i in range(68):
            coords[i] = (shape.part(i).x, shape.part(i).y)
        return coords

Aug 17 '23 03:08 quqixun

very thankful

Aug 17 '23 03:08 xz0305

very thankful

Aug 18 '23 06:08 rjc7011855

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

您好，想问一下您的复现效果如何，可以交流一下吗

Aug 18 '23 11:08 979277

@979277

训练了一些epoch，下面是一些效果。 difftalk_demo.zip

我这经过预处理之后总共有400+段视频片段，作者只给了训练集的视频名称，没有给测试集的，所以我就直接随机分了数据集。

由于训练过程中内存占用不断增加(看最上面的问题描述)，经过多次实验，最终每个视频使用前1100帧(间隔一帧取一帧)用作训练和测试。difftalk_demo.zip 里的视频是测试集中的视频，使用连续的前720帧做的测试。可以看到还是有点效果的。

后面的实验我打算减少视频数量，使用每个视频的所有帧。使用那种同一个视频可以截取出多个视频片段的数据，其中一个片段作为测试集，其他视频作为训练集，再训练看看效果。

训练过程中没有验证集，只有测试集，最终的测试效果也是在测试集上观察的，可能有数据泄露的风险。作者应该也是这么搞得。

Aug 21 '23 03:08 quqixun

how you downloaded the hdtf data, the video I downloaded has no sound

Aug 21 '23 07:08 xz0305

@xz0305 You can use youtube-dl or yt-dlp to download videos with best quality both in video and audio channel.

Aug 22 '23 01:08 quqixun

@quqixun 您好，这一步是将每一帧的音频保存为npy吗，我这样做生成的特征长度都是0，请问可以讲解一下具体过程吗 1692686778232(1)

Aug 22 '23 06:08 xz0305

@xz0305 保存的是音频特征，就用的 https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features

Aug 22 '23 09:08 quqixun

@xz0305 保存的是音频特征，就用的 https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features

Thank you for sharing this link. If the video contains 3000 frames, then using this repo for audio feature extraction returns one .npy file with (3000,16,29) shape. However, for the DiffTalk model, we need a separate .npy file for each frame. Can you please share how can we do this? Thanks

Aug 22 '23 11:08 Tinaa23

@Tinaa23

Make (3000,16,29) to (3000, 8, 16, 29). 3000 : number of frames 8 : sequence length for each frame 16 : window size 29 : number of features See https://github.com/sstzal/DiffTalk/issues/10#issuecomment-1641661343 .

Or you can refer the code at https://github.com/miu200521358/NeuralVoicePuppetryMMD/blob/master/Audio2ExpressionNet/Training%20Code/data/audio_dataset.py#L85 , there are two ways to generate the sequence.

Aug 22 '23 11:08 quqixun

@979277

训练了一些epoch，下面是一些效果。 difftalk_demo.zip

我这经过预处理之后总共有400+段视频片段，作者只给了训练集的视频名称，没有给测试集的，所以我就直接随机分了数据集。

由于训练过程中内存占用不断增加(看最上面的问题描述)，经过多次实验，最终每个视频使用前1100帧(间隔一帧取一帧)用作训练和测试。difftalk_demo.zip 里的视频是测试集中的视频，使用连续的前720帧做的测试。可以看到还是有点效果的。

后面的实验我打算减少视频数量，使用每个视频的所有帧。使用那种同一个视频可以截取出多个视频片段的数据，其中一个片段作为测试集，其他视频作为训练集，再训练看看效果。

训练过程中没有验证集，只有测试集，最终的测试效果也是在测试集上观察的，可能有数据泄露的风险。作者应该也是这么搞得。

想问一下你做了全量测试吗？我做下来发现这个方法似乎对一些训练集没见过的id效果不太好

Aug 31 '23 03:08 979277

请问提取的视频帧和音频帧帧数是对应的吗？我把视频处理成了25fps，截取了前1000帧，这样的话音频应该对应的是40s, 而在16khz的采样率下它共有2400帧，请问应该怎么处理呢

Sep 10 '23 07:09 zyhsuperman

@979277

训练了一些epoch，下面是一些效果。 difftalk_demo.zip

我这经过预处理之后总共有400+段视频片段，作者只给了训练集的视频名称，没有给测试集的，所以我就直接随机分了数据集。

由于训练过程中内存占用不断增加(看最上面的问题描述)，经过多次实验，最终每个视频使用前1100帧(间隔一帧取一帧)用作训练和测试。difftalk_demo.zip 里的视频是测试集中的视频，使用连续的前720帧做的测试。可以看到还是有点效果的。

后面的实验我打算减少视频数量，使用每个视频的所有帧。使用那种同一个视频可以截取出多个视频片段的数据，其中一个片段作为测试集，其他视频作为训练集，再训练看看效果。

训练过程中没有验证集，只有测试集，最终的测试效果也是在测试集上观察的，可能有数据泄露的风险。作者应该也是这么搞得。

Hi. I have a basic question and I hope you can help me with it. How can we specify the number of epochs in this code? this model only trains for 1 epoch on my machine.

Oct 13 '23 08:10 Tinaa23

音频处理部分沿用了AD-Nerf的操作，使用deepspeech作为音频特征提取器。

我在实验中没有出现内存占用不断增加的情况，如果您能找到问题所在欢迎指出并修正，改动也可以合并到该项目中。

difftalk_demo.zip中的效果看起来还可以。我们在实际应用中还增加了一步后处理操作。具体地，我们使用了[Real-time intermediate flow estimation for video frame interpolation]这一工作进行帧插值，以获得更流畅的视频。

Dec 11 '23 07:12 sstzal

After preprocessing of HDTF dataset, I got 415 videos. 249 videos (60%) were randomly selected as training set, the others (40%) were test set. The first 1500 frames of each video were extracted for training with stride 2. So, I got 277,117 frames in training set, and 179,711 frames in test set.

My machine has 4 A100 GPUs with 40GB VRAM, and 377GB RAM and 72GB Swap. In training, the batch size is set to 16. At the first epoch, the usage of RAM is always increasing. At step 2743, all RAM was occupied (even the Swap space) and the training stopped. Thus, 2743 * 16 * 4 = 175,552 is the max number of frames can be used in training for my machine, and the test set was not token into account. I tried to reduce the number of frames of both training and test set to 10,000 frames, and the training process is OK.

Questions @sstzal :

Did you meet the same problem in your training?

If so, how did you solve the problem?

Is it possible to release the weights of diffusion model?

I guess the reason of this problem is that there are too much log during training.

Hi, could I know whether your downloaded HDTF videos has audio stream? Could you share the downloading link? Many thanks

Jan 31 '24 20:01 kaiw7

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

I am getting [x,16,29] where x is the number of frames after deepspeech_features

Feb 05 '24 05:02 Utkarsh-shift

thanks i got you answer in comment above

Feb 05 '24 06:02 Utkarsh-shift

@rjc7011855 Try https://github.com/YudongGuo/AD-NeRF/tree/master/data_util/deepspeech_features and try to make speech features in shape [8, 16, 29].

I am getting [x,16,29] where x is the number of frames after deepspeech_features

Hi, could I know how to download the dataset? I met some issues with dataset downloading. Thank you very much

Feb 15 '24 02:02 kaiw7

@xz0305 It is quite simple to get 68 landmarks using dlib. http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2

import cv2
import dlib
import numpy as np


class LandmarksExtractor(object):

    def __init__(self, model_path):

        self.detector = dlib.get_frontal_face_detector()
        self.predictor = dlib.shape_predictor(model_path)

    def forward(self, image, is_rgb=True):

        if not is_rgb:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        landmarks = self.__predict(image)

        return landmarks

    def __predict(self, image):
        faces = self.detector(image, 1)
        assert len(faces) > 0
        face = faces[0]
        landmarks = self.predictor(image, face)
        landmarks = self.shape_to_np(landmarks)
        return landmarks

    @staticmethod
    def shape_to_np(shape, dtype=int):
        coords = np.zeros((68, 2), dtype=dtype)
        for i in range(68):
            coords[i] = (shape.part(i).x, shape.part(i).y)
        return coords

我沿用了AD-nerf的处理方式,您是否会遇到RuntimeError: stack expects each tensor to be equal size, but got [4, 16, 29] at entry 0 and [8, 16, 29] at entry 1 这样的问题呢?

Mar 25 '24 04:03 jinlingxueluo

请问在说明中的 |——data/HDTF |——images |——0_0.jpg |——0_1.jpg |——... |——N_M.bin |——landmarks |——0_0.lmd |——0_1.lmd |——... |——N_M.lms |——audio_smooth |——0_0.npy |——0_1.npy |——... |——N_M.npy 0_0.jpg和0_1.jpg代表的是某一个视频的第一帧和第二帧，还是某一个视频分段之后的每一段的第一帧？ N_M.bin，N_M.lms储存的是什么信息？最后的音频文件0_0.jpy与0_0.jpg的对应关系应该是什么？是某一帧内的音频特征，还是某一段内的音频特征？希望可以有大佬帮忙解惑感激不尽

Aug 27 '24 08:08 SCP2922