How to Add Voice Input and Output to Your Chatbot
What Voice Features Are Available
The platform supports two voice capabilities that work independently or together. Voice input uses OpenAI Whisper to transcribe spoken questions into text, so visitors can talk to your chatbot hands-free. Voice output uses AWS Polly to synthesize the chatbot's text response into spoken audio, delivered as MP3 or OGG. You can enable one or both depending on your use case.
AWS Polly provides over 100 neural and standard voices across 30+ languages. Neural voices sound more natural and support lip-sync timing data (called tween data), which is useful if you are building an animated avatar chatbot. Standard voices cost less and work well for straightforward audio responses.
Before You Start
You need an existing chatbot configured in the AI Chatbot app. Voice features are added on top of a working text chatbot, so make sure your chatbot is responding correctly to typed messages first. You also need the Voices app installed on your account.
Step-by-Step Setup
Open the Voices app in your admin panel and create a new character. Choose a voice from the AWS Polly library, pick your engine (neural recommended for natural sound), set the output format to MP3 or OGG, and select the language. This character profile stores all the voice settings your chatbot will use.
Open your chatbot settings in the AI Chatbot app. Find the voice character field and select the character you just created. This tells the chatbot to synthesize every response into audio using that voice profile. Save your chatbot settings.
In your chatbot embed settings, enable the audio input option. This adds a microphone button to the chat widget. When a visitor clicks it and speaks, their audio is sent to OpenAI Whisper for transcription, and the resulting text is processed by the chatbot like any typed message.
Open your chatbot on your website and click the microphone button to ask a question by voice. The chatbot should return both a text response and an audio player with the spoken version. Check that the voice sounds right and the transcription is accurate.
If you are building an animated character, enable the tween option on your voice character. This generates viseme timing marks alongside the audio, which your frontend code can use to animate mouth movements in sync with the speech. Tween data requires the neural engine and doubles the synthesis cost since it requires a second processing pass.
Choosing the Right Voice
For customer-facing chatbots, neural voices are worth the small cost increase because they sound significantly more natural. Standard voices work well for internal tools or high-volume applications where cost matters more than polish. You can preview voices in the Voices app before assigning one to your chatbot.
If your audience speaks multiple languages, create separate voice characters for each language and assign them to language-specific chatbots. AWS Polly supports languages including English, Spanish, French, German, Japanese, Portuguese, Italian, and many more.
Common Use Cases
- Accessibility: Voice input and output make your chatbot usable for visitors who have difficulty typing or reading
- Hands-free support: Warehouse workers, drivers, and field staff can interact with your chatbot by voice while working
- Animated avatars: Combine voice output with tween data to create a talking character that greets website visitors
- Phone-like experience: Give visitors a conversational, voice-first interaction that feels more personal than text chat
Add a voice to your AI chatbot today. Natural-sounding speech in 30+ languages.
Get Started Free