How to Generate Lip Sync Animation Data With TTS
What Lip Sync Data Is
Lip sync data is a sequence of timed markers that describe what shape a character's mouth should be at each moment during speech. Each marker contains a timestamp (when in the audio this mouth shape occurs), a viseme identifier (which mouth shape to display), and duration or blend weight information for smooth transitions between shapes.
Visemes are the visual equivalent of phonemes. While phonemes are the distinct sounds in speech, visemes are the distinct mouth shapes that correspond to those sounds. English speech uses roughly 15-20 viseme categories. For example, the "B" and "P" sounds share the same viseme (lips pressed together), while "O" has a distinct round mouth shape, and "EE" shows a wide mouth with visible teeth.
The platform generates this data by analyzing the phonetic content of the TTS output and mapping it to the standard viseme set. Because the lip sync data is generated from the same process as the audio, the timing alignment is precise. There is no drift or sync error that you would get from analyzing audio separately with a third-party lip sync tool.
How to Request Lip Sync Data
In your API call to the voices app, include the parameter that requests lip sync data along with the audio. The exact parameter name depends on your integration method, but the concept is the same: you tell the API you want animation data returned alongside the audio file.
The API returns both the audio file (typically in MP3 or WAV format) and a JSON array of viseme events. Each event contains a time offset in milliseconds, a viseme type identifier, and a value or weight for blending. The audio and viseme data are synchronized so you can play both together.
Your rendering engine reads the viseme events and applies them to your character model. In Unity, this means driving blend shapes on a 3D face mesh. In a web application, this could mean swapping 2D mouth sprites or adjusting SVG paths. In any system that supports keyframe animation, you can map viseme events to the appropriate mouth pose at each timestamp.
Raw viseme data gives you the target mouth shape at each point in time. For smooth animation, your renderer needs to interpolate (tween) between consecutive shapes. A linear interpolation works for basic cases, but ease-in-out curves produce more natural looking mouth movement. See How Lip Sync Animation Tween Data Works for detailed implementation guidance.
What the Viseme Data Looks Like
The response contains an array of objects, each representing one mouth position change. A simplified example looks like this: a timestamp of 0ms with viseme "sil" (silence, mouth closed), then 120ms with viseme "p" (lips together for the "P" sound), then 200ms with viseme "ae" (open mouth for the "A" sound), and so on. The exact format and viseme naming convention depends on the TTS provider being used, but the platform normalizes the data into a consistent structure you can rely on.
The density of viseme events varies with the speech content. Fast speech generates more events per second than slow speech. A typical sentence produces 20-40 viseme events. Pauses in speech generate silence visemes that tell your animation system to close the mouth and hold still.
Use Cases for Lip Sync Data
- Talking avatars: Virtual customer service agents, AI tutors, and brand mascots that speak to users with realistic mouth movement. See How to Build a Talking Avatar.
- Game characters: NPCs and dialogue characters with lip-synced speech, generated on the fly instead of pre-animated. See AI Voice for Game Characters.
- E-learning presenters: Animated instructors that deliver course content with natural mouth movement, making the learning experience more engaging.
- Interactive kiosks: Digital characters on kiosk screens that greet and assist visitors in retail, healthcare, and public spaces. See AI Voice for Kiosks.
- Video content: Automated video generation where an animated character presents scripted content with synchronized lip movement.
Why Integrated Lip Sync Matters
The traditional approach to lip sync involves three separate steps: generate the audio, run it through a lip sync analysis tool, then align the results and fix timing errors. Each step adds latency, cost, and potential for sync drift. The platform eliminates this by generating audio and lip sync data together, from the same source analysis, in a single API call.
This matters most for real-time applications. A voice chatbot with a talking avatar needs to generate and display speech within a second or two of the user's message. Running audio through a separate analysis pipeline after generation would double the latency and make the interaction feel sluggish. Integrated generation keeps the round trip fast enough for conversational use.
Generate speech with synchronized lip sync animation data in a single API call. Build talking avatars, game characters, and interactive experiences.
Get Started Free