GenAIExamples [WIP] Add a Avatar Chatbot (Audio) example

Description

Initiate Avatar Chatbot (Audio) example

Issues

opea-project/docs#59

Type of change

List the type of change like below. Please delete options that are not relevant.

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds new functionality)
[ ] Breaking change (fix or feature that would break existing design and interface)
[ ] Others (enhancement, documentation, validation, etc.)

Dependencies

Wav2Lip-GFPGAN

Appending "animation" microservice to AudioQnA example, to make a new avatar chatbot example opea-project/GenAIComps#400

Tests

curl http://${host_ip}:3009/v1/avatarchatbot
-X POST
-d @sample_whoareyou.json
-H 'Content-Type: application/json'

If the megaservice is running properly, you should see the following output:

"/outputs/result.mp4"

Aug 04 '24 03:08 ctao456

Hi @ctao456 , thanks for contribution. The talking avatar with Wav2Lip-GFPGAN looks good. Before reviewing I have a few questions. I notice that you use HPU to run Wav2Lip and report the latency "10-50 seconds for AvatarAnimation on Gaudi". How long is the driven audio? Is that the latency of the first run or after a warmup? Have you tried to optimize the related models (Wav2Lip model, GFPGAN) on HPU? With optimization it could be faster by fully utilizing the static shape feature on Gaudi.

Aug 06 '24 05:08 Spycsh

Hi @ctao456 , thanks for contribution. The talking avatar with Wav2Lip-GFPGAN looks good. Before reviewing I have a few questions. I notice that you use HPU to run Wav2Lip and report the latency "10-50 seconds for AvatarAnimation on Gaudi". How long is the driven audio? Is that the latency of the first run or after a warmup? Have you tried to optimize the related models (Wav2Lip model, GFPGAN) on HPU? With optimization it could be faster by fully utilizing the static shape feature on Gaudi.

Hi @Spycsh thank you for your comments.

In the demo video, the driven audio was 22s seconds long, and the inference time was around 50 seconds using both Wav2Lip-GAN and GFPGAN models (--inference_mode set to wav2lip+gfpgan). There will be significant speedup by switching --inference_mode flag to wav2lip_only, with some tradeoff on face restoration quality.
"10-50 seconds for AvatarAnimation on Gaudi" is the latency of the first run, without warming up. But we can try including warm-up to speed up.
- We're using eager mode on Gaudi 2. Not applying torch.compile at the moment because torch.compile didn't work for the GFPGAN model. We met some HPU-PTBridge issues with lazy mode as well. Haven't tried torch.jit.trace() yet.
Thank you for your suggestion. The current efforts focus on building the micro- and megaservice architecture. We will gradually add more features for a. HPUs optimization (eager mode with torch.compile v.s. lazy mode, torch.jit, HPU graph, BF16 & INT8 precision, etc.) to acclerate graph inference b. Distributed inference on multiple Gaudi cards, using DeepSpeed. c. Support for more SoTA face animation models (SadTalker, LivePortrait, etc.)

Aug 06 '24 15:08 ctao456