[Epic]: Intelligent Agent

Open anselm opened this issue 2 years ago • 0 comments

Support for digital interactive chatbots

We want interactive multiplayer digital chatbots for Ethereal. You should be able to talk to them with voice or text, interact with objects, do simple scripted activities. We want them to interact with multiple participants, and we may want multiple chatbots in one room. A chatbot is a 3d avatar that you can interact with (using your avatar, your voice or text) and that responds in a context appropriate way using spatialized voice, text, animated body and hand gestures, phonemes mapped to visemes, additional facial emotional expressions, and also had a sense of presence as communicated by gaze, eye blinking, breathing.

To deliver this we need to handle inputs from the user (voice, text, gaze, position, intent, time or other event riggers). We need to have an LLM that is prompted with a specific persona (personality, background, memory, goals, emotions). We need to get back a text response, pipe that through a tts service, and simultaneously convert text to phonemes to visemes and then animate custom chatbots in a pleasant way.

FRAMING

[X] Run on server not on client - so that it is truly multiplayer - decouple server side from any client side rendering
[X] Isolate client side rendered puppet from any third party llm api ( inworld or whatever ) decouple dependencies
[X] Define an IntelligentAgentComponent that can be attached to an entity with a rig and that can drive that rig. The developer must also provide a ModelComponent that references a gltf that represents the visual appearance of the model. At the moment we only support ready player me rigs!
[X] Text input - a chat box
[x] Spatial distance filtering on server side for text traffic (this is broken for some reason)
[x] Server side Speech To Text maybe with a push to speak

REASONING / COGNITION (Server Side)

[X] Inworld.ai bindings as a first step. Inworld is a nice chatbot engine that offers a reasonably turnkey service that handles several of our concerns such as persona, tts, llm support. The code is quite messy and needs to be cleaned up.
[X] We need some kind of token server because inworld requires it - right now i am running that locally... instead we have to run it in the cloud.
[X] Test Support hugging face models for llm and for tts (in addition to inworld)
[x] For hugging face tts I need to estimate the audio duration
[x] For hugging face tts I need to provision dedicated servers? May need a credit card
[x] For hugging face I need to use a custom phoneme generator and phoneme to viseme generator
[x] For hugging face I need more voices such as a female voice
[x] For hugging face I may as well switch to using chatgpt or something like that for reasoning?

SCENARIOS

[x] Dr Ubiquity physiotherapist - use Inworld llm
[x] An Astronaut
[x] A noir detective
[x] Test multiple chatbots at once, test chatbots interacting with multiple users.

PUPPETRY ANIMATION (Client Side)

[x] Puppetry Version 1 - totally isolate the client side puppet away from any inworld code - using RPM models
[x] Puppetry Version 2 - try support 'talking-head' which has a richer body model; with ik - using RPM models
[ ] Puppetry version 3 - try a standalone model - merging both above - using RPM models
[ ] Puppetry version 4 - switch to EE VRM models
[ ] Puppetry version 5 - have a pipeline for making models from scratch; totally control model creation
[ ] Refine our animation model that is gesturing, breathing, gazing, and attentive. We do have this but some of the scheduling and timing of these animations could use work.

RICHER INTERACTIONS

[ ] The ability to Point at stuff
[ ] The ability to walk to things (some kind of navmesh may be needed)
[ ] Decision trees or explicit sequencing of behavior over time - not using an LLM
[ ] Simple Intents/Triggers: state changes, players enter or leave, objects appear or leave, player focus/gaze, time changes - possibly not using an llm
[ ] Deeper Intents that are llm driven - given Player Gaze / Proximity, Time Awareness. Send events to the LLM that notify the LLM of what a nearby player is focusing on and other smaller events over time. The LLM may then be able to respond as in "Ah I see you are looking at the 2023 series road bike did you know that it is the first to offer regenerative braking".
[ ] [not critical] Review https://github.com/EtherealEngine/Digital-Beings

OTHER

[ ] 3d spatialize the audio for Apple VSP experiences
[ ] A form fill out field that lets one of our customers describe a chatbot (once supporting huggingface llms)

Jan 19 '24 16:01 anselm