How Lip Sync Animation Tween Data Works
Understanding the Data Structure
When you request lip sync data alongside a TTS response, the API returns an array of viseme events. Each event has three key properties: time (when this mouth shape should be active, in milliseconds from the start of the audio), type (which viseme or mouth shape to display), and value (the intensity or blend weight, typically 0.0 to 1.0).
The events are sorted chronologically. Your animation system reads through them in order, advancing to each new event as the audio playback reaches its timestamp. Between events, you interpolate (tween) from the current mouth shape to the next one, which creates the smooth transitions that make lip sync look natural rather than choppy.
Viseme Types
Visemes are the visual mouth shapes that correspond to groups of speech sounds. Different sounds that look the same when spoken share the same viseme. For example, the sounds "B," "P," and "M" all use the same viseme (lips pressed together) because you cannot see the difference between them by watching someone's mouth.
The standard viseme set used by the platform includes approximately 15-20 shapes, covering all the distinct mouth positions needed for natural-looking English speech. Common visemes include:
- sil (silence): Mouth closed, neutral position. Used during pauses and at the beginning and end of speech.
- PP (bilabial): Lips pressed together for B, P, M sounds.
- FF (labiodental): Lower lip touching upper teeth for F, V sounds.
- TH (dental): Tongue between teeth for TH sounds.
- DD (alveolar): Tongue touching the ridge behind upper teeth for D, T, N, L sounds.
- kk (velar): Back of tongue raised for K, G sounds.
- CH (post-alveolar): Tongue forward and lips slightly rounded for CH, SH, J sounds.
- SS (sibilant): Teeth close together for S, Z sounds.
- RR (retroflex): Tongue curled back for R sounds.
- aa (open): Mouth wide open for the "AH" vowel sound.
- E (mid front): Mouth moderately open for "EH" sounds.
- ih (near close): Small mouth opening for "IH" sounds.
- oh (mid back rounded): Lips rounded for "OH" sounds.
- ou (close back): Lips in a small circle for "OO" sounds.
Implementing Tweening
Linear Interpolation
The simplest approach: calculate a blend factor between 0 and 1 based on how far you are between two consecutive viseme events, and mix the two mouth shapes proportionally. If the current event says "open mouth" at 100ms and the next says "lips together" at 200ms, then at 150ms you show a 50/50 blend of both shapes. This produces functional lip sync but can look slightly robotic because real mouth movements do not follow perfectly linear timing.
Ease-In-Out Interpolation
A better approach: apply an ease-in-out curve to the blend factor so the transition starts slowly, speeds up in the middle, and slows down at the end. This mimics how real muscles accelerate and decelerate, producing more natural looking movement. A simple smoothstep function works well: blend = t * t * (3 - 2 * t), where t is the linear progress from 0 to 1.
Blend Shape Implementation (3D)
In Unity or similar 3D engines, each viseme maps to a blend shape (morph target) on the character's face mesh. At each animation frame, read the current viseme event and set the corresponding blend shape weight. For tweening, gradually reduce the weight of the previous viseme's blend shape while increasing the weight of the current one. Unity's SkinnedMeshRenderer.SetBlendShapeWeight handles this directly.
Sprite Swap Implementation (2D)
For 2D characters, create a sprite for each viseme category (or at least the major ones: closed, slightly open, wide open, rounded, lips together). At each viseme event, swap to the corresponding sprite. For smoother transitions, use cross-fading between sprites with a short transition time. Even without smooth transitions, snapping between sprites at the right timing looks convincing for cartoon-style characters.
Timing Synchronization
The most critical aspect of lip sync is keeping the animation in sync with the audio playback. If the mouth moves ahead of or behind the sound, the illusion breaks immediately. Always use the audio playback position as your timing reference, not wall-clock time. In a web browser, use the Audio element's currentTime property. In Unity, use AudioSource.time. This ensures that if the audio stutters, buffers, or pauses, the animation follows.
Pre-buffer the entire viseme array and use binary search or a pointer to find the current event based on audio time, rather than processing events in real time. This avoids accumulating timing errors over long speech segments.
Optimizing for Performance
Lip sync animation is lightweight computationally. The viseme data for a typical sentence is a few kilobytes at most. Processing it at 60fps adds negligible CPU load. The main performance consideration is blend shape evaluation on 3D characters, which can be expensive if the face mesh has many vertices. For mobile and web targets, keep face mesh complexity reasonable (under 10,000 vertices) and limit the number of active blend shapes to what you actually need for lip sync (typically 5-8 shapes are enough for convincing results).
Get lip sync animation data with every TTS request. Build talking characters with precise, synchronized mouth movement.
Get Started Free