Home » AI Voice » Talking Avatar

How to Build a Talking Avatar With AI Voice

A talking avatar is a visual character that speaks with AI-generated voice and synchronized lip movement. You build one by combining a character model (2D or 3D), AI text-to-speech for audio, and lip sync viseme data to animate the mouth. The AI Apps API platform generates both the speech and animation data in a single call, so the avatar speaks naturally without manual animation work.

The Three Components of a Talking Avatar

Every talking avatar system has three parts that work together: the visual character, the voice audio, and the animation data that connects them.

1. The Visual Character

This is the face or figure your users see. It can be a 3D model rendered in Unity or Unreal Engine, a 2D illustrated character with swappable mouth sprites on a web page, a stylized cartoon, or a realistic human face. The complexity of the character determines how much animation work is needed. A simple 2D character with five or six mouth positions looks good enough for most chatbot and kiosk applications. A detailed 3D face with dozens of blend shapes produces cinema-quality lip sync but requires more development effort.

2. The AI Voice

The voice is generated by the TTS API. You send text, and it returns an audio file of a natural sounding voice reading that text. The voice you choose sets the personality of the avatar. A warm, friendly ElevenLabs voice suits a customer service avatar. A crisp, authoritative AWS Polly voice fits a news presenter character. See How to Choose the Right AI Voice for detailed guidance.

3. The Lip Sync Animation Data

This is the bridge between voice and visuals. The API returns a sequence of timed viseme events alongside the audio. Your rendering engine reads these events and adjusts the character's mouth to match the speech in real time. See How to Generate Lip Sync Animation Data for the technical details of what the data looks like and how to process it.

Building a Web-Based Talking Avatar

Step 1: Create the character artwork.
Design or commission a character face with a set of mouth positions. For a basic web implementation, you need images or SVG shapes for 6-10 mouth positions: closed, slightly open, wide open, rounded (O shape), wide smile (EE shape), lips together (B/P/M sounds), teeth on lip (F/V sounds), and a few transitions. Each mouth position corresponds to a group of visemes.
Step 2: Set up the TTS API call.
Configure your application to send text to the AI Apps API voice endpoint with lip sync data enabled. The response contains the audio URL and a JSON array of viseme events with timestamps.
Step 3: Build the animation loop.
Write a JavaScript function that plays the audio and simultaneously reads through the viseme events. As the audio playback position reaches each viseme timestamp, update the displayed mouth image or SVG. Use requestAnimationFrame for smooth updates and interpolate between mouth positions rather than snapping, which makes the animation look more natural.
Step 4: Add idle animations.
A talking avatar that freezes between speech looks unnatural. Add subtle idle animations like blinking, small head movements, and breathing to keep the character alive even when not speaking. These run continuously and blend with the lip sync animation during speech.
Step 5: Connect to your chatbot or content system.
Wire the avatar to your text source. For a chatbot, the flow is: user sends message, chatbot generates text response, text goes to TTS API, audio and viseme data come back, avatar speaks the response. For scripted content like tutorials or presentations, queue the text segments and play them in sequence.

Building a 3D Talking Avatar in Unity

For game developers and interactive 3D applications, the approach is similar but uses Unity's blend shape system instead of 2D sprite swapping. Your 3D character model needs facial blend shapes (sometimes called morph targets) for each viseme. Unity's animation system can interpolate between blend shape weights smoothly, which produces very natural looking lip movement.

The pipeline is: receive viseme data from the API, map each viseme identifier to the corresponding blend shape index on your character, and set the blend shape weight at each timestamp. Unity's AudioSource component handles audio playback while a coroutine or Update loop handles the blend shape updates in sync. The result is a character whose lips move naturally with the AI-generated speech.

Real-Time vs Pre-Rendered Avatars

Real-time avatars generate speech and animation on the fly during user interaction. This is what you need for chatbots, customer service agents, and interactive kiosks. Latency matters here, so choose a fast TTS provider and optimize your rendering pipeline. See How to Optimize Voice Output for Low Latency.

Pre-rendered avatars generate speech and animation ahead of time and serve the finished video or animation. This works for e-learning content, marketing videos, and any case where the content does not change dynamically. Pre-rendering lets you use the highest quality voices and rendering settings without worrying about latency, then distribute the results as standard video files.

Common Talking Avatar Applications

Quality tip: The most important factor in a convincing talking avatar is not rendering quality but timing accuracy. Even a simple 2D character with six mouth shapes looks great if the lip sync timing is tight. Conversely, a beautifully rendered 3D face with poor sync timing looks worse than no animation at all. Focus on getting the timing right before investing in visual complexity.

Build a talking avatar with AI voice and synchronized lip animation. One API call gives you both speech and animation data.

Get Started Free