EAT_code How to generate files for custom audio?

According to the instructions provided here:

Note 2: To test with a custom audio, you need to replace the video_name/video_name.wav and deepspeech feature video_name/deepfeature32/video_name.npy. The output length will depend on the shortest length of the audio and driven poses. Refer to here for more details.

I have copied a custom audio file in 16khz sampling rate, like the following: video_processed/00014

├── 00014.wav ├── deepfeature32 ├── latent_evp_25 └── poseimg

From the above, how do I get here?

video_processed/00014 ├── 00014.wav ├── deepfeature32 │ └── 00014.npy ├── latent_evp_25 │ └── 00014.npy └── poseimg └── 00014.npy.gz

Jun 06 '24 06:06 subharya83

Hi, if you have only one audio, you need to follow these steps:

Extract deepfeature32 with the code here.
Organize the files, then replace the pose-related files latent_evp_25/00014.npy and poseimg/00014.npy.gz with our preprocessed obama video (5 min) ./demo/video_processed/obama/.
Test as the readme shows.

If you have one video, you need to preprocess your custom video first according to our preprocess code. And then test as other preprocessed videos.

If you have any questions, feel free to contact us.

Jun 06 '24 07:06 yuangan

Thank you for getting back to me so quickly. And, thanks for the incredible effort.

Jun 06 '24 07:06 subharya83

I followed the instructions you gave me. After running the deepspeech feature extraction code, this is what my directory structure looks like:

tree  -f ./demo/video_processed/00014/
./demo/video_processed/00014
├── ./demo/video_processed/00014/00014.wav
├── ./demo/video_processed/00014/deepfeature32
│   └── ./demo/video_processed/00014/deepfeature32/00014.npy
├── ./demo/video_processed/00014/latent_evp_25
│   └── ./demo/video_processed/00014/latent_evp_25/00014.npy
└── ./demo/video_processed/00014/poseimg
    └── ./demo/video_processed/00014/poseimg/00014.npy.gz

However, the code still fails here:

EAT_code/demo.py", line 171, in prepare_test_data
    audio_frames = torch.stack(audio_frames, dim=0)
RuntimeError: stack expects a non-empty TensorList

Looks like the audio_frames list of tensors is empty. This is the output of ffprobe on my custom audio:

ffprobe -hide_banner ./demo/video_processed/00014/00014.wav 
Input #0, wav, from './demo/video_processed/00014/00014.wav':
 Metadata:
   encoder         : Lavf58.76.100
 Duration: 00:00:06.32, bitrate: 256 kb/s
 Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s

And this is the structure of the deepspeech feature extracted npy file:

>>> d=np.load('./demo/video_processed/00014/deepfeature32/00014.npy')
>>> print(d.shape, d.min(), d.max())
(158, 16, 29) -45.72098159790039 22.231658935546875

Any thoughts on what I might be doing wrong?

Jun 06 '24 18:06 subharya83

Hi, have you checked the value of num_frames here?

I forget the processed gt frames. Maybe you need to copy the cropped images from the preprocessed obama files.

Jun 07 '24 12:06 yuangan