Config Structure
A voice pipeline config has the following structure:plugins and clients are required. All other fields are optional.
Root-Level Options
clients
clients
Required. Enable or disable specific client transports.Configuration:
Example:
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
browser | boolean | No | true | Enable browser WebSocket connections. |
twilio | boolean | No | false | Enable Twilio Media Streams connections. |
metadata
metadata
Custom key-value data attached to every session. This metadata is included in webhook payloads and can be used for tracking, analytics, or passing context to your agent.Example:
session_webhook
session_webhook
Configure webhooks for session lifecycle events. Useful for logging, analytics, or triggering external workflows when sessions start, end, or update.Configuration:
Example:
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | - | Webhook endpoint URL. Must be HTTPS. |
custom_headers | Record<string, string> | No | - | Additional headers to send with webhook requests. |
custom_metadata | Record<string, any> | No | - | Extra metadata to include in webhook payloads. |
events | array<"session.start" | "session.end" | "session.update"> | No | All events | Which events to send to the webhook. |
session_duration_timeout_minutes
session_duration_timeout_minutes
Maximum session duration in minutes. Sessions automatically end after this timeout.Configuration:
Example:
| Type | Required | Default | Min | Max |
|---|---|---|---|---|
| number | No | 30 | 1 | 1440 (24 hours) |
vad
vad
Voice Activity Detection (VAD) configuration. VAD detects when users start and stop speaking, enabling natural turn-taking. It is enabled by default, but in some cases you may want to disable it or edit the advanced settings. In most cases you do not need to include the vad config or edit these settings.Configuration:
Example:
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
enabled | boolean | No | true | Enable voice activity detection. |
gate_audio | boolean | No | true | Only send audio to STT when speech is detected. |
buffer_frames | number | No | 10 | Number of audio frames to buffer (0-20). |
model | "v5" | No | "v5" | VAD model version. |
positive_speech_threshold | number | No | - | Confidence threshold for detecting speech (0-1). |
negative_speech_threshold | number | No | - | Confidence threshold for detecting silence (0-1). |
redemption_frames | number | No | - | Frames of silence before ending speech detection (0-10). |
min_speech_frames | number | No | - | Minimum frames required to count as speech (0-10). |
pre_speech_pad_frames | number | No | - | Frames to include before detected speech (0-10). |
Plugins
Plugins are the processing steps in your voice pipeline. They must be specified in order:use field (the plugin type) and an optional options object.
STT Plugins (Speech-to-Text)
Convert incoming audio to text transcripts. LayerCode supports two STT providers:| Provider | Key Required | Models |
|---|---|---|
| Deepgram | No (managed) | Flux (English, ultra-low latency), Nova-3 (multilingual) |
| AssemblyAI | No (managed) | Universal Streaming (English or multilingual) |
stt.deepgram
stt.deepgram
Deepgram speech-to-text with Nova-3 or Flux models.Configuration:
Example:
model_id: "flux"
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
model_id | "flux" | Yes | - | Deepgram Flux STT model. |
language | English (en) | No | "en" | Language. Flux only supports English currently. |
keyterms | array<string> | No | - | Array of key terms to boost transcription accuracy for. |
model_id: "nova-3"
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
model_id | "nova-3" | Yes | - | Deepgram Nova STT model. |
language | Multilingual (English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch) (multi), Bulgarian (bg), Catalan (ca), Czech (cs), Danish (da), Danish (Denmark) (da-DK), Dutch (nl), English (en), English (US) (en-US), English (Australia) (en-AU), English (UK) (en-GB), English (India) (en-IN), English (New Zealand) (en-NZ), Estonian (et), Finnish (fi), Flemish (nl-BE), French (fr), French (Canada) (fr-CA), German (de), German (Switzerland) (de-CH), Greek (el), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Korean (ko), Korean (Korea) (ko-KR), Latvian (lv), Lithuanian (lt), Malay (ms), Norwegian (no), Polish (pl), Portuguese (pt), Portuguese (Brazil) (pt-BR), Portuguese (Portugal) (pt-PT), Romanian (ro), Russian (ru), Slovak (sk), Spanish (es), Spanish (Latin America) (es-419), Swedish (sv), Swedish (Sweden) (sv-SE), Turkish (tr), Ukrainian (uk), Vietnamese (vi) | No | "multi" | Language. |
keyterms | array<string> | No | - | Array of key terms to boost transcription accuracy for. |
stt.assemblyai
stt.assemblyai
AssemblyAI Universal Streaming speech-to-text. Supports English and multilingual (English, Spanish, French, German, Italian, Portuguese). Managed by LayerCode—no API key required.Configuration:
Example:
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
speech_model | "universal-streaming-english" | "universal-streaming-multilingual" | No | "universal-streaming-english" | Speech model. Multilingual supports English, Spanish, French, German, Italian, Portuguese. |
word_boost | array<string> | No | - | Array of custom vocabulary words to boost recognition accuracy. |
end_of_turn_confidence_threshold | number (min: 0, max: 1) | No | - | Confidence threshold (0.0-1.0) for detecting end of turn. Default: 0.4 |
min_end_of_turn_silence_when_confident | number (min: 0, max: 9007199254740991) | No | - | Minimum silence in milliseconds when confident about end of turn. Default: 400 |
max_turn_silence | number (min: 0, max: 9007199254740991) | No | - | Maximum silence in milliseconds before end of turn is triggered. Default: 1280 |
Turn Manager
Manages conversation turn-taking between user and assistant. Handles interruptions (barge-in) and determines when the user has finished speaking.turn_manager
turn_manager
VAD-based turn management with configurable timeout.Configuration:
Example:
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
mode | "automatic" | No | "automatic" | Turn-taking mode. Only automatic (VAD-based interruption) is supported. |
base_timeout_ms | number (min: 500, max: 5000) | No | 2000 | Base VAD timeout in milliseconds (e.g., 500-5000). Required. |
user_silence_timeout_minutes | unknown | No | - | User silence timeout in minutes (e.g., 1-60). Null/undefined disables the timeout. |
disable_interruptions_during_welcome | boolean | No | false | Disable user interruptions during the first assistant response (welcome message). |
Agent Plugins
Generate AI responses from user messages. Choose one based on your use case:agent.llm- Hosted LLM for simple conversational agentsagent.webhook- Your own HTTPS endpoint for custom logicagent.ws- Your own WebSocket server for real-time bidirectional communication
agent.llm
agent.llm
Hosted LLM agent using Google Gemini or OpenAI models. Best for simple conversational agents without custom business logic.Configuration:
Example (Google):Example (OpenAI):
| Option | Type | Required | Default | Description |
|---|
agent.webhook
agent.webhook
Send user messages to your HTTPS endpoint and receive streaming responses. Best for integrating with existing backends or AI orchestration frameworks.Configuration:
Example:
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | - | Webhook endpoint URL |
headers | Record<string, string> | No | - | HTTP headers to send with requests |
events | array<”message”|”data”|”session.start”> | No | ["message"] | Events to forward to webhook. ‘message’ is required, ‘session.start’, ‘data’ are optional. |
TTS Plugins (Text-to-Speech)
Convert agent text responses to audio. LayerCode supports three TTS providers:| Provider | Key Required | Best For |
|---|---|---|
| Inworld | No (managed) | High quality, low cost expressive voices |
| Rime | No (managed) | Expressive voices |
| Cartesia | Yes (BYOK) | Customers with a Cartesia account |
| ElevenLabs | Yes (BYOK) | Customers with an Elevenlabs account |
tts.rime
tts.rime
Rime TTS with ultra-low latency streaming. Managed by LayerCode—no API key required.Configuration:
Example:
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
model_id | "mistv2" | Yes | - | Rime TTS model. |
voice_id | string | No | "courtney" | Rime voice id. |
language | "eng", "spa" | No | "eng" | Language. |
tts.inworld
tts.inworld
Inworld TTS for gaming and interactive characters with voice tuning controls. Requires your own Inworld API credentials.Configuration:
Example:
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
model_id | "inworld-tts-1" | "inworld-tts-1.5-max" | "inworld-tts-1.5-mini" | No | "inworld-tts-1" | Inworld TTS model. |
voice_id | string | No | "Clive" | Inworld voice id. |
voice_config | object | No | - | - |
voice_config options:| Option | Type | Required | Default | Description |
|---|---|---|---|---|
pitch | number (min: -10, max: 10) | No | 1 | Voice pitch adjustment. Range: -10 to 10. Default: 1. |
speaking_rate | number (min: 0, max: 5) | No | 0 | Speaking rate/speed. Range: 0 to 5. Default: 0. |
robotic_filter | number (min: 0, max: 5) | No | 0 | Robotic voice filter level. Range: 0 to 5. Default: 0. |
tts.elevenlabs
tts.elevenlabs
ElevenLabs TTS with high-quality voices and extensive voice customization. Requires your own ElevenLabs API key.Configuration:
Example:
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
model_id | "eleven_v2_5_flash" | Yes | - | ElevenLabs TTS model. |
voice_id | string | Yes | - | ElevenLabs voice id. |
voice_settings | object | No | - | - |
language | English (en), Japanese (ja), Chinese (zh), German (de), Hindi (hi), French (fr), Korean (ko), Portuguese (pt), Italian (it), Spanish (es), Indonesian (id), Dutch (nl), Turkish (tr), Filipino (fil), Polish (pl), Swedish (sv), Bulgarian (bg), Romanian (ro), Arabic (ar), Czech (cs), Greek (el), Finnish (fi), Croatian (hr), Malay (ms), Slovak (sk), Danish (da), Tamil (ta), Ukrainian (uk), Russian (ru), Hungarian (hu), Norwegian (no), Vietnamese (vi) | No | "en" | Language. |
voice_settings options:| Option | Type | Required | Default | Description |
|---|---|---|---|---|
stability | number (min: 0, max: 1) | No | - | Defines the stability for voice settings. Default is 0.5. |
similarity_boost | number (min: 0, max: 1) | No | - | Defines the similarity boost for voice settings. Default is 0.75. |
style | number (min: 0, max: 1) | No | - | Defines the style for voice settings. This parameter is available on V2+ models. Default 0. |
use_speaker_boost | boolean | No | - | Defines the use speaker boost for voice settings. This parameter is available on V2+ models. Default true. |
speed | number (min: 0.7, max: 1.2) | No | - | Controls the speed of the generated speech. Values range from 0.7 to 1.2. Default is 1.0. |
tts.cartesia
tts.cartesia
Cartesia Sonic TTS with emotion controls and word-level timestamps. Requires your own Cartesia API key.Configuration:
Example:
model_id: "sonic-2"
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
model_id | "sonic-2" | Yes | - | Cartesia Sonic 2 TTS model. |
voice_id | string | Yes | - | Cartesia voice id. |
language | English (en), French (fr), German (de), Spanish (es), Portuguese (pt), Chinese (zh), Japanese (ja), Hindi (hi), Italian (it), Korean (ko), Dutch (nl), Polish (pl), Russian (ru), Swedish (sv), Turkish (tr) | No | "en" | Language. |
model_id: "sonic-3"
| Option | Type | Required | Default | Description |
|---|---|---|---|---|
model_id | "sonic-3", "sonic-3-2025-10-27" | Yes | - | Cartesia Sonic 3 TTS model with expanded language support. |
voice_id | string | Yes | - | Cartesia voice id. |
voice_settings | object | No | - | - |
language | English (en), French (fr), German (de), Spanish (es), Portuguese (pt), Chinese (zh), Japanese (ja), Hindi (hi), Italian (it), Korean (ko), Dutch (nl), Polish (pl), Russian (ru), Swedish (sv), Turkish (tr), Tagalog (tl), Bulgarian (bg), Romanian (ro), Arabic (ar), Czech (cs), Greek (el), Finnish (fi), Croatian (hr), Malay (ms), Slovak (sk), Danish (da), Tamil (ta), Ukrainian (uk), Hungarian (hu), Norwegian (no), Vietnamese (vi), Bengali (bn), Thai (th), Hebrew (he), Georgian (ka), Indonesian (id), Telugu (te), Gujarati (gu), Kannada (kn), Malayalam (ml), Marathi (mr), Punjabi (pa) | No | "en" | Language. |
voice_settings options:| Option | Type | Required | Default | Description |
|---|---|---|---|---|
volume | number (min: 0.5, max: 2) | No | - | Adjusts the volume of the generated speech. Values range from 0.5 to 2.0. Default 1.0. |
speed | number (min: 0.6, max: 1.5) | No | - | Controls the speed of the generated speech. Values range from 0.6 to 1.5. Default 1.0. |
emotion | string | No | - | Controls the emotion of the generated speech. Primary emotions are neutral, calm, angry, content, sad, scared. See docs for more options. |
Complete Examples
- Simple Hosted LLM
- Custom Webhook Agent
- Twilio Phone Integration
A minimal configuration using LayerCode’s hosted LLM agent:
Audio Format
The pipeline automatically handles audio format conversion based on the client type:| Client | Input Format | Output Format |
|---|---|---|
| Browser | PCM16 | PCM16 |
| Twilio | mulaw @ 8kHz | mulaw @ 8kHz |