How to Build a Talking Avatar With AI Voice
The Three Components of a Talking Avatar
Every talking avatar system has three parts that work together: the visual character, the voice audio, and the animation data that connects them.
1. The Visual Character
This is the face or figure your users see. It can be a 3D model rendered in Unity or Unreal Engine, a 2D illustrated character with swappable mouth sprites on a web page, a stylized cartoon, or a realistic human face. The complexity of the character determines how much animation work is needed. A simple 2D character with five or six mouth positions looks good enough for most chatbot and kiosk applications. A detailed 3D face with dozens of blend shapes produces cinema-quality lip sync but requires more development effort.
2. The AI Voice
The voice is generated by the TTS API. You send text, and it returns an audio file of a natural sounding voice reading that text. The voice you choose sets the personality of the avatar. A warm, friendly ElevenLabs voice suits a customer service avatar. A crisp, authoritative AWS Polly voice fits a news presenter character. See How to Choose the Right AI Voice for detailed guidance.
3. The Lip Sync Animation Data
This is the bridge between voice and visuals. The API returns a sequence of timed viseme events alongside the audio. Your rendering engine reads these events and adjusts the character's mouth to match the speech in real time. See How to Generate Lip Sync Animation Data for the technical details of what the data looks like and how to process it.
Building a Web-Based Talking Avatar
Design or commission a character face with a set of mouth positions. For a basic web implementation, you need images or SVG shapes for 6-10 mouth positions: closed, slightly open, wide open, rounded (O shape), wide smile (EE shape), lips together (B/P/M sounds), teeth on lip (F/V sounds), and a few transitions. Each mouth position corresponds to a group of visemes.
Configure your application to send text to the AI Apps API voice endpoint with lip sync data enabled. The response contains the audio URL and a JSON array of viseme events with timestamps.
Write a JavaScript function that plays the audio and simultaneously reads through the viseme events. As the audio playback position reaches each viseme timestamp, update the displayed mouth image or SVG. Use requestAnimationFrame for smooth updates and interpolate between mouth positions rather than snapping, which makes the animation look more natural.
A talking avatar that freezes between speech looks unnatural. Add subtle idle animations like blinking, small head movements, and breathing to keep the character alive even when not speaking. These run continuously and blend with the lip sync animation during speech.
Wire the avatar to your text source. For a chatbot, the flow is: user sends message, chatbot generates text response, text goes to TTS API, audio and viseme data come back, avatar speaks the response. For scripted content like tutorials or presentations, queue the text segments and play them in sequence.
Building a 3D Talking Avatar in Unity
For game developers and interactive 3D applications, the approach is similar but uses Unity's blend shape system instead of 2D sprite swapping. Your 3D character model needs facial blend shapes (sometimes called morph targets) for each viseme. Unity's animation system can interpolate between blend shape weights smoothly, which produces very natural looking lip movement.
The pipeline is: receive viseme data from the API, map each viseme identifier to the corresponding blend shape index on your character, and set the blend shape weight at each timestamp. Unity's AudioSource component handles audio playback while a coroutine or Update loop handles the blend shape updates in sync. The result is a character whose lips move naturally with the AI-generated speech.
Real-Time vs Pre-Rendered Avatars
Real-time avatars generate speech and animation on the fly during user interaction. This is what you need for chatbots, customer service agents, and interactive kiosks. Latency matters here, so choose a fast TTS provider and optimize your rendering pipeline. See How to Optimize Voice Output for Low Latency.
Pre-rendered avatars generate speech and animation ahead of time and serve the finished video or animation. This works for e-learning content, marketing videos, and any case where the content does not change dynamically. Pre-rendering lets you use the highest quality voices and rendering settings without worrying about latency, then distribute the results as standard video files.
Common Talking Avatar Applications
- AI customer service agent on your website that greets visitors and answers questions with a friendly face
- Virtual tutor for online courses that explains concepts and walks through lessons
- Game NPC that delivers quest dialogue with lip-synced speech in real time
- Kiosk greeter in a store, hospital, or museum that guides visitors
- Brand mascot that introduces products or reads announcements on your website
- Virtual receptionist for businesses that answers calls or chats with an animated face
Build a talking avatar with AI voice and synchronized lip animation. One API call gives you both speech and animation data.
Get Started Free