[WIP] Add a Avatar Chatbot (Audio) example
Description
Initiate Avatar Chatbot (Audio) example
Issues
Type of change
List the type of change like below. Please delete options that are not relevant.
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds new functionality)
- [ ] Breaking change (fix or feature that would break existing design and interface)
- [ ] Others (enhancement, documentation, validation, etc.)
Dependencies
Appending "animation" microservice to AudioQnA example, to make a new avatar chatbot example opea-project/GenAIComps#400
Tests
curl http://${host_ip}:3009/v1/avatarchatbot
-X POST
-d @sample_whoareyou.json
-H 'Content-Type: application/json'
If the megaservice is running properly, you should see the following output:
"/outputs/result.mp4"
Hi @ctao456 , thanks for contribution. The talking avatar with Wav2Lip-GFPGAN looks good. Before reviewing I have a few questions. I notice that you use HPU to run Wav2Lip and report the latency "10-50 seconds for AvatarAnimation on Gaudi". How long is the driven audio? Is that the latency of the first run or after a warmup? Have you tried to optimize the related models (Wav2Lip model, GFPGAN) on HPU? With optimization it could be faster by fully utilizing the static shape feature on Gaudi.
Hi @ctao456 , thanks for contribution. The talking avatar with Wav2Lip-GFPGAN looks good. Before reviewing I have a few questions. I notice that you use HPU to run Wav2Lip and report the latency "10-50 seconds for AvatarAnimation on Gaudi". How long is the driven audio? Is that the latency of the first run or after a warmup? Have you tried to optimize the related models (Wav2Lip model, GFPGAN) on HPU? With optimization it could be faster by fully utilizing the static shape feature on Gaudi.
Hi @Spycsh thank you for your comments.
- In the demo video, the driven audio was 22s seconds long, and the inference time was around 50 seconds using both Wav2Lip-GAN and GFPGAN models (
--inference_modeset towav2lip+gfpgan). There will be significant speedup by switching--inference_modeflag towav2lip_only, with some tradeoff on face restoration quality. - "10-50 seconds for AvatarAnimation on Gaudi" is the latency of the first run, without warming up. But we can try including warm-up to speed up.
- We're using eager mode on Gaudi 2. Not applying
torch.compileat the moment because torch.compile didn't work for the GFPGAN model. We met someHPU-PTBridgeissues with lazy mode as well. Haven't triedtorch.jit.trace()yet.
- We're using eager mode on Gaudi 2. Not applying
- Thank you for your suggestion. The current efforts focus on building the micro- and megaservice architecture. We will gradually add more features for a. HPUs optimization (eager mode with torch.compile v.s. lazy mode, torch.jit, HPU graph, BF16 & INT8 precision, etc.) to acclerate graph inference b. Distributed inference on multiple Gaudi cards, using DeepSpeed. c. Support for more SoTA face animation models (SadTalker, LivePortrait, etc.)