Voice & Speech
OpenACP supports two speech features: speech-to-text (STT) for voice input and text-to-speech (TTS) for spoken responses. Both are optional and configured independently.
Speech-to-text (STT)
Provider: Groq (uses the Whisper large v3 turbo model) Cost: Free tier available at console.groq.com — 28,800 seconds of audio per day
When STT is configured, you can send voice messages to a session topic and OpenACP transcribes them before passing the text to the agent. The transcribed text appears in the topic as a system message:
You said: "Add a unit test for the login function"The agent then receives the transcription as a normal text prompt. If the agent natively supports audio input, the audio attachment is passed directly instead.
Supported audio formats: OGG, WAV, MP3, M4A, WebM, FLAC (maximum 25 MB per file).
Configuring STT
Add your Groq API key to the config (see Configuration for the full speech config reference):
{
"speech": {
"stt": {
"provider": "groq",
"providers": {
"groq": {
"apiKey": "gsk_..."
}
}
}
}
}Or use /settings in Telegram — tap the STT provider field and the assistant will walk you through entering an API key.
STT error handling
If transcription fails (network issue, rate limit, invalid key), the audio attachment is kept and passed to the agent as-is, with an error message in the topic. The Groq free tier limit is 28,800 seconds per day; if exceeded, transcription fails gracefully.
Text-to-speech (TTS)
Provider: Edge TTS (Microsoft's neural TTS service) Cost: Free, no API key required Default voice: en-US-AriaNeural Output format: MP3 (24 kHz, 48 kbps mono)
When TTS is active for a session, the agent is instructed to include a spoken-friendly summary of its response in a [TTS]...[/TTS] block. OpenACP extracts this block, synthesizes audio, and sends it back to the chat as a voice message. TTS synthesis has a 30-second timeout — if it exceeds this, the audio is skipped silently.
The agent decides what to include in the TTS block. It focuses on key information, decisions the user needs to make, or required actions. The response language matches whatever language you are using.
Voice modes
TTS operates in one of three modes per session:
off
No TTS (default)
next
TTS for the next message only, then reverts to off
on
TTS for every subsequent message
Toggling TTS
Telegram — in a session topic:
Via the session control keyboard: Tap the "Text to Speech" toggle button in the session setup message.
Discord:
Configuring TTS
Edge TTS works out of the box with no configuration. To change the voice, update your config:
Microsoft Edge TTS supports a large number of voices across many languages. Voice names follow the pattern {language}-{region}-{name}Neural.
Enabling TTS in config
Set provider to null to disable TTS entirely.
Using both together
STT and TTS work independently. You can use either or both at the same time. A typical voice workflow:
Send a voice message in a session topic
OpenACP transcribes it via Groq STT
The transcription appears as "You said: ..."
The agent processes the text and responds
If TTS is on, the response summary is synthesized and sent as audio
Last updated
Was this helpful?
