Home » AI Voice » Lip Sync Data

How to Generate Lip Sync Animation Data With TTS

When you generate speech through the AI Apps API voice system, you can request lip sync animation data alongside the audio. The API returns timed viseme data that maps mouth positions to the audio timeline, giving you everything you need to animate a character's lips in sync with the spoken words in a single API call.

What Lip Sync Data Is

Lip sync data is a sequence of timed markers that describe what shape a character's mouth should be at each moment during speech. Each marker contains a timestamp (when in the audio this mouth shape occurs), a viseme identifier (which mouth shape to display), and duration or blend weight information for smooth transitions between shapes.

Visemes are the visual equivalent of phonemes. While phonemes are the distinct sounds in speech, visemes are the distinct mouth shapes that correspond to those sounds. English speech uses roughly 15-20 viseme categories. For example, the "B" and "P" sounds share the same viseme (lips pressed together), while "O" has a distinct round mouth shape, and "EE" shows a wide mouth with visible teeth.

The platform generates this data by analyzing the phonetic content of the TTS output and mapping it to the standard viseme set. Because the lip sync data is generated from the same process as the audio, the timing alignment is precise. There is no drift or sync error that you would get from analyzing audio separately with a third-party lip sync tool.

How to Request Lip Sync Data

Step 1: Send a TTS request with lip sync enabled.
In your API call to the voices app, include the parameter that requests lip sync data along with the audio. The exact parameter name depends on your integration method, but the concept is the same: you tell the API you want animation data returned alongside the audio file.
Step 2: Receive the combined response.
The API returns both the audio file (typically in MP3 or WAV format) and a JSON array of viseme events. Each event contains a time offset in milliseconds, a viseme type identifier, and a value or weight for blending. The audio and viseme data are synchronized so you can play both together.
Step 3: Feed the data to your animation system.
Your rendering engine reads the viseme events and applies them to your character model. In Unity, this means driving blend shapes on a 3D face mesh. In a web application, this could mean swapping 2D mouth sprites or adjusting SVG paths. In any system that supports keyframe animation, you can map viseme events to the appropriate mouth pose at each timestamp.
Step 4: Apply tweening for smooth transitions.
Raw viseme data gives you the target mouth shape at each point in time. For smooth animation, your renderer needs to interpolate (tween) between consecutive shapes. A linear interpolation works for basic cases, but ease-in-out curves produce more natural looking mouth movement. See How Lip Sync Animation Tween Data Works for detailed implementation guidance.

What the Viseme Data Looks Like

The response contains an array of objects, each representing one mouth position change. A simplified example looks like this: a timestamp of 0ms with viseme "sil" (silence, mouth closed), then 120ms with viseme "p" (lips together for the "P" sound), then 200ms with viseme "ae" (open mouth for the "A" sound), and so on. The exact format and viseme naming convention depends on the TTS provider being used, but the platform normalizes the data into a consistent structure you can rely on.

The density of viseme events varies with the speech content. Fast speech generates more events per second than slow speech. A typical sentence produces 20-40 viseme events. Pauses in speech generate silence visemes that tell your animation system to close the mouth and hold still.

Use Cases for Lip Sync Data

Why Integrated Lip Sync Matters

The traditional approach to lip sync involves three separate steps: generate the audio, run it through a lip sync analysis tool, then align the results and fix timing errors. Each step adds latency, cost, and potential for sync drift. The platform eliminates this by generating audio and lip sync data together, from the same source analysis, in a single API call.

This matters most for real-time applications. A voice chatbot with a talking avatar needs to generate and display speech within a second or two of the user's message. Running audio through a separate analysis pipeline after generation would double the latency and make the interaction feel sluggish. Integrated generation keeps the round trip fast enough for conversational use.

Performance note: Lip sync data adds minimal overhead to the TTS request. The audio generation is the slow part, and the viseme mapping happens during that same process. Requesting lip sync data typically adds less than 100ms to the total response time.

Generate speech with synchronized lip sync animation data in a single API call. Build talking avatars, game characters, and interactive experiences.

Get Started Free