How to Integrate AI Voice With Your Application API
API Basics
The AI Apps API uses a standard REST interface. You send a POST request to the API endpoint with your account credentials (API key and account ID), the app name (voices), a command name, and a JSON payload containing the text to speak, voice selection, and any optional parameters. The API returns the generated audio as a downloadable file and, if requested, lip sync data as JSON.
The same API handles text-to-speech (sending text, getting audio back), speech-to-text (sending audio, getting text back), and lip sync data generation. Each is a different command within the voices app, but they all follow the same request pattern.
Text-to-Speech API Call
To generate speech, send a request with the text content and voice parameters. Key fields include the text to speak, the voice provider (polly, elevenlabs, google), the specific voice ID, and whether you want lip sync data included. The response contains a URL to the generated audio file and, optionally, a JSON array of viseme events for lip sync animation.
The audio file is typically MP3 format, suitable for playback in browsers, mobile apps, and most media players. For applications that need higher quality or uncompressed audio, WAV format is also available depending on the provider.
Speech-to-Text API Call
For transcription, upload an audio file with the transcription command. The API sends it to Whisper for processing and returns the transcribed text. Supported audio formats include MP3, WAV, WebM, M4A, and most other common formats. The API handles format conversion internally, so you can send whatever your recording device captures natively.
Integration Patterns
Server-Side Integration
The most common pattern: your backend server makes API calls to the AI Apps API, receives the results, and serves them to your frontend. This keeps your API credentials on the server (never expose them in client-side code) and lets you add caching, logging, and rate limiting. PHP, Python, Node.js, Ruby, and any language with HTTP client support works. The platform provides a PHP SDK, but any language that can make POST requests with JSON payloads works fine.
Chatbot Voice Integration
For voice-enabled chatbots, the integration adds two API calls to the normal chatbot flow. Before the chatbot call: send user audio to the speech-to-text endpoint to get text input. After the chatbot call: send the chatbot's text response to the text-to-speech endpoint to get audio output. Your existing chatbot logic in between stays unchanged. See How to Add AI Voice to Your Chatbot.
Unity Game Integration
For Unity games, use UnityWebRequest to call the API from within your game. Request TTS with lip sync data, then apply the returned viseme events to your character's facial blend shapes. Audio playback uses Unity's AudioSource component. The API call happens asynchronously via coroutines, so the game does not freeze while waiting for the response. See AI Voice for Game Characters.
Mobile App Integration
Mobile apps call the API from their backend server (not directly from the device to avoid exposing credentials). The backend generates audio and returns it to the app for playback. For speech-to-text, the app captures audio on-device and uploads it through the backend to the transcription endpoint. Both iOS and Android have native audio playback and recording APIs that work seamlessly with the generated audio files.
Caching Generated Audio
If the same text is spoken repeatedly (greetings, menu options, common responses), cache the generated audio rather than regenerating it on every request. This saves API credits and reduces latency to near zero for cached content. Common caching strategies include storing audio files on disk or in a CDN, keyed by a hash of the text and voice parameters. Only regenerate when the text or voice selection changes.
Error Handling
API calls can fail due to network issues, invalid parameters, or service outages. Implement retry logic with exponential backoff for transient failures. For voice-enabled chatbots, have a text-only fallback: if TTS fails, display the chatbot response as text instead of leaving the user with silence. For speech-to-text failures, show a "could not understand, please try again" message and let the user type instead.
Integrate AI voice into any application with simple API calls. Text-to-speech, speech-to-text, and lip sync in one platform.
Get Started Free