How to generate files for custom audio?
According to the instructions provided here:
Note 2: To test with a custom audio, you need to replace the video_name/video_name.wav and deepspeech feature video_name/deepfeature32/video_name.npy. The output length will depend on the shortest length of the audio and driven poses. Refer to here for more details.
I have copied a custom audio file in 16khz sampling rate, like the following: video_processed/00014
├── 00014.wav ├── deepfeature32 ├── latent_evp_25 └── poseimg
From the above, how do I get here?
video_processed/00014 ├── 00014.wav ├── deepfeature32 │ └── 00014.npy ├── latent_evp_25 │ └── 00014.npy └── poseimg └── 00014.npy.gz
Hi, if you have only one audio, you need to follow these steps:
- Extract deepfeature32 with the code here.
- Organize the files, then replace the pose-related files
latent_evp_25/00014.npyandposeimg/00014.npy.gzwith our preprocessed obama video (5 min)./demo/video_processed/obama/. - Test as the readme shows.
If you have one video, you need to preprocess your custom video first according to our preprocess code. And then test as other preprocessed videos.
If you have any questions, feel free to contact us.
Thank you for getting back to me so quickly. And, thanks for the incredible effort.
I followed the instructions you gave me. After running the deepspeech feature extraction code, this is what my directory structure looks like:
tree -f ./demo/video_processed/00014/
./demo/video_processed/00014
├── ./demo/video_processed/00014/00014.wav
├── ./demo/video_processed/00014/deepfeature32
│ └── ./demo/video_processed/00014/deepfeature32/00014.npy
├── ./demo/video_processed/00014/latent_evp_25
│ └── ./demo/video_processed/00014/latent_evp_25/00014.npy
└── ./demo/video_processed/00014/poseimg
└── ./demo/video_processed/00014/poseimg/00014.npy.gz
However, the code still fails here:
EAT_code/demo.py", line 171, in prepare_test_data
audio_frames = torch.stack(audio_frames, dim=0)
RuntimeError: stack expects a non-empty TensorList
Looks like the audio_frames list of tensors is empty. This is the output of ffprobe on my custom audio:
ffprobe -hide_banner ./demo/video_processed/00014/00014.wav
Input #0, wav, from './demo/video_processed/00014/00014.wav':
Metadata:
encoder : Lavf58.76.100
Duration: 00:00:06.32, bitrate: 256 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s
And this is the structure of the deepspeech feature extracted npy file:
>>> d=np.load('./demo/video_processed/00014/deepfeature32/00014.npy')
>>> print(d.shape, d.min(), d.max())
(158, 16, 29) -45.72098159790039 22.231658935546875
Any thoughts on what I might be doing wrong?
Hi, have you checked the value of num_frames here?
I forget the processed gt frames. Maybe you need to copy the cropped images from the preprocessed obama files.