The Making of Ellie: How WWT Built its Cutting-Edge Digital Human in 5 Weeks Using NVIDIA AI Platforms
Learn how WWT used state-of-the-art AI tools and techniques, including NVIDIA's Avatar Cloud Engine (ACE), to build Ellie, a digital human capable of delivering seamless conversational experiences and authentic, lifelike responses in multiple languages.
At the start of 2024, WWT embarked on an innovative journey by launching a RAGbot, a chatbot powered by NVIDIA's retrieval-augmented generation technology. This initiative laid the groundwork for our next venture: developing a universal translator on the NVIDIA platform, focusing on multilingual speech-to-text translation using NVIDIA RIVA.
Our trajectory took an exciting turn when NVIDIA invited us to showcase a digital human at Disney's Data and Analytics Conference (DDAC), just five weeks away. This opportunity meant we needed to pivot swiftly, presenting both a challenge and a chance to highlight our AI capabilities.
To tackle this ambitious project, we assembled a team of experts and devised a comprehensive roadmap for designing, building, testing and deploying a digital human.
NVIDIA's Blueprint for Digital Human was in its developmental stages, so we adapted in near real-time to integrate the latest updates and insights. We tailored our digital human, Ellie, to operate on-premises within a fully air-gapped environment, utilizing minimal infrastructure resources. With just four GPUs, users could engage with a digital human capable of conversing in five languages.
Assembling the right team and processes
We knew we'd need to leverage the diverse expertise of multiple individuals across WWT to bring our vision for Ellie to life.
We assembled a tiger team composed of diverse experts, all united by an AI-first mindset, enabling us to efficiently identify and address various pain points. By sharing our best practices, troubleshooting techniques and innovative solutions, we enhanced our collaborative efforts.
Leveraging the NVIDIA Blueprint for Digital Human, we applied customizations to develop Ellie within a local setup that required minimal infrastructure and resources. This strategic approach allowed us to successfully complete our digital human in time for Disney's conference.
So, how did we do it?
Building a digital human
Digital humans are created using the NVIDIA ACE platform and a combination of Omniverse, Audio2Face, natural language processing (NLP), machine learning (ML), animation graph, speech synthesis and emotional recognition.
Discovery
Since the forum for showcasing the avatar would be at a Disney event, we asked ourselves what kind of scenario would be most compelling given the type of products and services Disney offers. Many of us have visited Disney theme parks. Being familiar with the experience, we picked the scenario of a park visitor who might need help locating rides or restaurants, perhaps wanting to know about wait times or food options. Wouldn't it be helpful if you could just walk up to a screen and talk to an assistant to find out what you needed? Disney guests come from around the world, so what if the avatar could determine what language you were speaking and respond to you in kind? We set this as our goal for what to build.
Solution design
We built Ellie leveraging NVIDIA's ACE platform — a suite of cloud-native AI models and services crafted for creating and deploying interactive digital avatars with generative AI (GenAI). At its core, ACE combines key NVIDIA technologies designed to handle speech recognition, natural language understanding and real-time rendering, all hosted on a scalable, cloud-native architecture. We built Ellie on-premises with everything running locally. This helps us avoid latency problems at conferences and other events.
Prototyping and proof of concept
We typically develop a prototype for the client to react to so we can gather very detailed feedback for the final version. Because of the time crunch, we demoed the prototype at the conference and have continued to refine and improve Ellie since then.
Overcoming challenges through innovative thinking
From maintaining conversational context to managing memory limitations to matching appropriate facial expressions to different types of responses, each obstacle we encountered demanded a creative solution.
One of our primary concerns was ensuring Ellie could engage in seamless, natural-sounding dialogues. However, the nascent AI platform was limited in the length of conversations it could handle. So, we developed custom workarounds to keep track of the conversational context and provide smooth transitions between questions and responses.
We also faced issues with memory constraints, which threatened to cause Ellie to crash during extended interactions. We had to carefully manage the system's resources to prevent it from abruptly stopping.
Innovation at work: Ellie works offline!
To address these types of challenges, we drew upon our collective technical expertise and problem-solving skills. We experimented with various approaches, tested different configurations and ultimately developed solutions that allowed Ellie to operate seamlessly, even in the face of connectivity issues or offline scenarios.
In fact, we worked to make our digital human fully functional without an internet connection — something we've not encountered in any digital human solution in the market thus far. This feature allows Ellie to operate in remote or other locations where bandwidth is limited or nonexistent without compromising her performance or reliability.
Rigorous testing and troubleshooting
With the technical foundations in place, we turned our attention to making sure Ellie delivers flawless performances and truly engaging user experiences. Recognizing the importance of quality and accuracy, we implemented a rigorous testing and troubleshooting regimen to refine every aspect of the digital human.
The entire development team worked collaboratively to put Ellie through her paces. The testing was all about ensuring Ellie's responses were accurate, the gestures and animations were properly synchronized, and the overall experience was polished. We knew that any issues or failures could negatively impact the perception of the technology, so we were meticulous.
Deployment and training
After successful testing, Ellie was deployed with a robust system architecture that included offline functionality.
We provided hands-on training, documented through video and written instructions, so our internal operators can confidently demonstrate Ellie's capabilities to clients and partners at events around the world.
How Ellie works
Here's a high-level breakdown of how Ellie works to deliver seamless conversational experiences:
- Input processing and speech recognition: User input, either speech or text, is ingested and processed. For spoken input, NVIDIA Riva automatically converts audio to text and translates it to English if necessary.
- Natural language processing (NLP): The text is then processed through a large language model (LLM) powered by NVIDIA NeMo, which enables Ellie to understand complex queries and generate accurate, context-aware responses.
- Text-to-speech and animation: Once the response is generated, Riva's Text-to-Speech (TTS) module converts it back into the user's language. NVIDIA Audio2Face then animates the response, creating realistic facial movements that synchronize precisely with the speech, adding a layer of authenticity to the digital human interaction.
- Rendering and display: The animated response is rendered in real time using NVIDIA Omniverse, presenting a visually immersive, lifelike avatar that interacts naturally with users.
This robust solution runs on a single workstation equipped with 4 * RTX 6000 ADAs from NVIDIA (each 48GB of VRAM). The setup efficiently divides tasks across the GPUs, with two allocated to running the Llama3 8B model, one for Omniverse rendering, and a fourth supporting both Riva and the NVIDIA NIM embedding model. With integrated caching, this workflow operates offline, providing reliable performance without requiring internet connectivity.
Our system architecture incorporates a NIM embedding model, a vector database and NVIDIA NeMo Guardrails — an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems that ensure secure and controlled interactions. By enabling retrieval augmented generation (RAG), this setup enriches responses with relevant information and allows Ellie to answer follow-up questions, enhancing interaction depth and context continuity.
Pioneering the future of digital humans
After overcoming the challenges above within a five-week timeframe, WWT delivered a compelling demonstration of Ellie's impressive capabilities at DDAC.
Our highly technical solution demonstrates the advanced capabilities of NVIDIA's ACE suite and WWT's commitment to leveraging cutting-edge AI. With scalable, real-time interaction, multilingual support and extensive customization options, Ellie represents a new level of immersive and interactive digital human experiences that can be tailored to meet the demands of diverse industries and complex applications.
This project underscores our ongoing collaboration with NVIDIA to push the boundaries of AI and deliver unparalleled interactive experiences through digital humans.
This report may not be copied, reproduced, distributed, republished, downloaded, displayed, posted or transmitted in any form or by any means, including, but not limited to, electronic, mechanical, photocopying, recording, or otherwise, without the prior express written permission of WWT Research. It consists of the opinions of WWT Research and as such should be not construed as statements of fact. WWT provides the Report "AS-IS", although the information contained in Report has been obtained from sources that are believed to be reliable. WWT disclaims all warranties as to the accuracy, completeness or adequacy of the information.