Skip to main content
A voice pipeline defines how audio flows through your voice application. It connects speech-to-text, your agent logic, and text-to-speech into a seamless real-time conversation. Each Layercode agent has a config which defines how it works. The config is currently edited in the dashboard using the pipeline editor UI. In the near future we will allow a custom config JSON to be set on a per session basis.

Config Structure

A voice pipeline config has the following structure:
{
  "clients": { ... },
  "metadata": { ... },
  "session_webhook": { ... },
  "session_duration_timeout_minutes": 30,
  "vad": { ... },
  "plugins": [ ... ]
}
plugins and clients are required. All other fields are optional.

Root-Level Options

Required. Enable or disable specific client transports.Configuration:
OptionTypeRequiredDefaultDescription
browserbooleanNotrueEnable browser WebSocket connections.
twiliobooleanNofalseEnable Twilio Media Streams connections.
Example:
{
  "clients": {
    "browser": true,
    "twilio": true
  }
}
Custom key-value data attached to every session. This metadata is included in webhook payloads and can be used for tracking, analytics, or passing context to your agent.Example:
{
  "metadata": {
    "environment": "production",
    "version": "1.2.0",
    "customer_tier": "enterprise"
  }
}
Configure webhooks for session lifecycle events. Useful for logging, analytics, or triggering external workflows when sessions start, end, or update.Configuration:
OptionTypeRequiredDefaultDescription
urlstringYes-Webhook endpoint URL. Must be HTTPS.
custom_headersRecord<string, string>No-Additional headers to send with webhook requests.
custom_metadataRecord<string, any>No-Extra metadata to include in webhook payloads.
eventsarray<"session.start" | "session.end" | "session.update">NoAll eventsWhich events to send to the webhook.
Example:
{
  "session_webhook": {
    "url": "https://your-server.com/webhooks/voice",
    "custom_headers": {
      "X-Custom-Header": "value"
    },
    "events": ["session.start", "session.end"]
  }
}
Maximum session duration in minutes. Sessions automatically end after this timeout.Configuration:
TypeRequiredDefaultMinMax
numberNo3011440 (24 hours)
Example:
{
  "session_duration_timeout_minutes": 60
}
Voice Activity Detection (VAD) configuration. VAD detects when users start and stop speaking, enabling natural turn-taking. It is enabled by default, but in some cases you may want to disable it or edit the advanced settings. In most cases you do not need to include the vad config or edit these settings.Configuration:
OptionTypeRequiredDefaultDescription
enabledbooleanNotrueEnable voice activity detection.
gate_audiobooleanNotrueOnly send audio to STT when speech is detected.
buffer_framesnumberNo10Number of audio frames to buffer (0-20).
model"v5"No"v5"VAD model version.
positive_speech_thresholdnumberNo-Confidence threshold for detecting speech (0-1).
negative_speech_thresholdnumberNo-Confidence threshold for detecting silence (0-1).
redemption_framesnumberNo-Frames of silence before ending speech detection (0-10).
min_speech_framesnumberNo-Minimum frames required to count as speech (0-10).
pre_speech_pad_framesnumberNo-Frames to include before detected speech (0-10).
Example:
{
  "vad": {
    "enabled": true,
    "gate_audio": true,
    "buffer_frames": 10
  }
}

Plugins

Plugins are the processing steps in your voice pipeline. They must be specified in order:
stt.* → turn_manager → agent.* → tts.*
Each plugin is configured with a use field (the plugin type) and an optional options object.

STT Plugins (Speech-to-Text)

Convert incoming audio to text transcripts. LayerCode supports two STT providers:
ProviderKey RequiredModels
DeepgramNo (managed)Flux (English, ultra-low latency), Nova-3 (multilingual)
AssemblyAINo (managed)Universal Streaming (English or multilingual)
Both providers are managed by LayerCode — no API keys required.
Deepgram speech-to-text with Nova-3 or Flux models.Configuration:

model_id: "flux"

OptionTypeRequiredDefaultDescription
model_id"flux"Yes-Deepgram Flux STT model.
languageEnglish (en)No"en"Language. Flux only supports English currently.
keytermsarray<string>No-Array of key terms to boost transcription accuracy for.

model_id: "nova-3"

OptionTypeRequiredDefaultDescription
model_id"nova-3"Yes-Deepgram Nova STT model.
languageMultilingual (English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch) (multi), Bulgarian (bg), Catalan (ca), Czech (cs), Danish (da), Danish (Denmark) (da-DK), Dutch (nl), English (en), English (US) (en-US), English (Australia) (en-AU), English (UK) (en-GB), English (India) (en-IN), English (New Zealand) (en-NZ), Estonian (et), Finnish (fi), Flemish (nl-BE), French (fr), French (Canada) (fr-CA), German (de), German (Switzerland) (de-CH), Greek (el), Hindi (hi), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Korean (ko), Korean (Korea) (ko-KR), Latvian (lv), Lithuanian (lt), Malay (ms), Norwegian (no), Polish (pl), Portuguese (pt), Portuguese (Brazil) (pt-BR), Portuguese (Portugal) (pt-PT), Romanian (ro), Russian (ru), Slovak (sk), Spanish (es), Spanish (Latin America) (es-419), Swedish (sv), Swedish (Sweden) (sv-SE), Turkish (tr), Ukrainian (uk), Vietnamese (vi)No"multi"Language.
keytermsarray<string>No-Array of key terms to boost transcription accuracy for.
Example:
{
  "use": "stt.deepgram",
  "options": {
    "model_id": "nova-3",
    "language": "en-US",
    "keyterms": ["LayerCode", "Realpipe"]
  }
}
AssemblyAI Universal Streaming speech-to-text. Supports English and multilingual (English, Spanish, French, German, Italian, Portuguese). Managed by LayerCode—no API key required.Configuration:
OptionTypeRequiredDefaultDescription
speech_model"universal-streaming-english" | "universal-streaming-multilingual"No"universal-streaming-english"Speech model. Multilingual supports English, Spanish, French, German, Italian, Portuguese.
word_boostarray<string>No-Array of custom vocabulary words to boost recognition accuracy.
end_of_turn_confidence_thresholdnumber (min: 0, max: 1)No-Confidence threshold (0.0-1.0) for detecting end of turn. Default: 0.4
min_end_of_turn_silence_when_confidentnumber (min: 0, max: 9007199254740991)No-Minimum silence in milliseconds when confident about end of turn. Default: 400
max_turn_silencenumber (min: 0, max: 9007199254740991)No-Maximum silence in milliseconds before end of turn is triggered. Default: 1280
Example:
{
  "use": "stt.assemblyai",
  "options": {
    "speech_model": "universal-streaming-english",
    "word_boost": ["LayerCode", "Realpipe"]
  }
}

Turn Manager

Manages conversation turn-taking between user and assistant. Handles interruptions (barge-in) and determines when the user has finished speaking.
VAD-based turn management with configurable timeout.Configuration:
OptionTypeRequiredDefaultDescription
mode"automatic"No"automatic"Turn-taking mode. Only automatic (VAD-based interruption) is supported.
base_timeout_msnumber (min: 500, max: 5000)No2000Base VAD timeout in milliseconds (e.g., 500-5000). Required.
user_silence_timeout_minutesunknownNo-User silence timeout in minutes (e.g., 1-60). Null/undefined disables the timeout.
disable_interruptions_during_welcomebooleanNofalseDisable user interruptions during the first assistant response (welcome message).
Example:
{
  "use": "turn_manager",
  "options": {
    "base_timeout_ms": 2000,
    "disable_interruptions_during_welcome": true
  }
}

Agent Plugins

Generate AI responses from user messages. Choose one based on your use case:
  • agent.llm - Hosted LLM for simple conversational agents
  • agent.webhook - Your own HTTPS endpoint for custom logic
  • agent.ws - Your own WebSocket server for real-time bidirectional communication
Hosted LLM agent using Google Gemini or OpenAI models. Best for simple conversational agents without custom business logic.Configuration:
OptionTypeRequiredDefaultDescription
Example (Google):
{
  "use": "agent.llm",
  "options": {
    "provider": "google",
    "model_id": "gemini-2.5-flash-lite",
    "system_prompt": "You are a helpful customer service agent for Acme Corp.",
    "welcome_message": "Hi! Welcome to Acme Corp. How can I help you today?"
  }
}
Example (OpenAI):
{
  "use": "agent.llm",
  "options": {
    "provider": "openai",
    "model_id": "gpt-4o-mini",
    "system_prompt": "You are a friendly assistant.",
    "welcome_message": "Hello! What can I help you with?"
  }
}
Send user messages to your HTTPS endpoint and receive streaming responses. Best for integrating with existing backends or AI orchestration frameworks.Configuration:
OptionTypeRequiredDefaultDescription
urlstringYes-Webhook endpoint URL
headersRecord<string, string>No-HTTP headers to send with requests
eventsarray<”message”|”data”|”session.start”>No["message"]Events to forward to webhook. ‘message’ is required, ‘session.start’, ‘data’ are optional.
Example:
{
  "use": "agent.webhook",
  "options": {
    "url": "https://your-agent.example.com/voice",
    "headers": {
      "Authorization": "Bearer your-token"
    },
    "events": ["message", "session.start"]
  }
}

TTS Plugins (Text-to-Speech)

Convert agent text responses to audio. LayerCode supports three TTS providers:
ProviderKey RequiredBest For
InworldNo (managed)High quality, low cost expressive voices
RimeNo (managed)Expressive voices
CartesiaYes (BYOK)Customers with a Cartesia account
ElevenLabsYes (BYOK)Customers with an Elevenlabs account
Inworld or Rime is the easiest way to get started — LayerCode manages the credentials, so it works immediately. For Cartesia or ElevenLabs**, add your API key in Settings → Providers.
Rime TTS with ultra-low latency streaming. Managed by LayerCode—no API key required.Configuration:
OptionTypeRequiredDefaultDescription
model_id"mistv2"Yes-Rime TTS model.
voice_idstringNo"courtney"Rime voice id.
language"eng", "spa"No"eng"Language.
Example:
{
  "use": "tts.rime",
  "options": {
    "model_id": "mistv2",
    "voice_id": "courtney"
  }
}
Inworld TTS for gaming and interactive characters with voice tuning controls. Requires your own Inworld API credentials.Configuration:
OptionTypeRequiredDefaultDescription
model_id"inworld-tts-1" | "inworld-tts-1.5-max" | "inworld-tts-1.5-mini"No"inworld-tts-1"Inworld TTS model.
voice_idstringNo"Clive"Inworld voice id.
voice_configobjectNo--
voice_config options:
OptionTypeRequiredDefaultDescription
pitchnumber (min: -10, max: 10)No1Voice pitch adjustment. Range: -10 to 10. Default: 1.
speaking_ratenumber (min: 0, max: 5)No0Speaking rate/speed. Range: 0 to 5. Default: 0.
robotic_filternumber (min: 0, max: 5)No0Robotic voice filter level. Range: 0 to 5. Default: 0.
Example:
{
  "use": "tts.inworld",
  "options": {
    "model_id": "inworld-tts-1.5-max",
    "voice_id": "Clive",
    "voice_config": {
      "pitch": 1,
      "speaking_rate": 0,
      "robotic_filter": 0
    }
  }
}
ElevenLabs TTS with high-quality voices and extensive voice customization. Requires your own ElevenLabs API key.Configuration:
OptionTypeRequiredDefaultDescription
model_id"eleven_v2_5_flash"Yes-ElevenLabs TTS model.
voice_idstringYes-ElevenLabs voice id.
voice_settingsobjectNo--
languageEnglish (en), Japanese (ja), Chinese (zh), German (de), Hindi (hi), French (fr), Korean (ko), Portuguese (pt), Italian (it), Spanish (es), Indonesian (id), Dutch (nl), Turkish (tr), Filipino (fil), Polish (pl), Swedish (sv), Bulgarian (bg), Romanian (ro), Arabic (ar), Czech (cs), Greek (el), Finnish (fi), Croatian (hr), Malay (ms), Slovak (sk), Danish (da), Tamil (ta), Ukrainian (uk), Russian (ru), Hungarian (hu), Norwegian (no), Vietnamese (vi)No"en"Language.
voice_settings options:
OptionTypeRequiredDefaultDescription
stabilitynumber (min: 0, max: 1)No-Defines the stability for voice settings. Default is 0.5.
similarity_boostnumber (min: 0, max: 1)No-Defines the similarity boost for voice settings. Default is 0.75.
stylenumber (min: 0, max: 1)No-Defines the style for voice settings. This parameter is available on V2+ models. Default 0.
use_speaker_boostbooleanNo-Defines the use speaker boost for voice settings. This parameter is available on V2+ models. Default true.
speednumber (min: 0.7, max: 1.2)No-Controls the speed of the generated speech. Values range from 0.7 to 1.2. Default is 1.0.
Example:
{
  "use": "tts.elevenlabs",
  "options": {
    "model_id": "eleven_v2_5_flash",
    "voice_id": "EiNlNiXeDU1pqqOPrYMO",
    "voice_settings": {
      "stability": 0.5,
      "speed": 1.0
    }
  }
}
Cartesia Sonic TTS with emotion controls and word-level timestamps. Requires your own Cartesia API key.Configuration:

model_id: "sonic-2"

OptionTypeRequiredDefaultDescription
model_id"sonic-2"Yes-Cartesia Sonic 2 TTS model.
voice_idstringYes-Cartesia voice id.
languageEnglish (en), French (fr), German (de), Spanish (es), Portuguese (pt), Chinese (zh), Japanese (ja), Hindi (hi), Italian (it), Korean (ko), Dutch (nl), Polish (pl), Russian (ru), Swedish (sv), Turkish (tr)No"en"Language.

model_id: "sonic-3"

OptionTypeRequiredDefaultDescription
model_id"sonic-3", "sonic-3-2025-10-27"Yes-Cartesia Sonic 3 TTS model with expanded language support.
voice_idstringYes-Cartesia voice id.
voice_settingsobjectNo--
languageEnglish (en), French (fr), German (de), Spanish (es), Portuguese (pt), Chinese (zh), Japanese (ja), Hindi (hi), Italian (it), Korean (ko), Dutch (nl), Polish (pl), Russian (ru), Swedish (sv), Turkish (tr), Tagalog (tl), Bulgarian (bg), Romanian (ro), Arabic (ar), Czech (cs), Greek (el), Finnish (fi), Croatian (hr), Malay (ms), Slovak (sk), Danish (da), Tamil (ta), Ukrainian (uk), Hungarian (hu), Norwegian (no), Vietnamese (vi), Bengali (bn), Thai (th), Hebrew (he), Georgian (ka), Indonesian (id), Telugu (te), Gujarati (gu), Kannada (kn), Malayalam (ml), Marathi (mr), Punjabi (pa)No"en"Language.
voice_settings options:
OptionTypeRequiredDefaultDescription
volumenumber (min: 0.5, max: 2)No-Adjusts the volume of the generated speech. Values range from 0.5 to 2.0. Default 1.0.
speednumber (min: 0.6, max: 1.5)No-Controls the speed of the generated speech. Values range from 0.6 to 1.5. Default 1.0.
emotionstringNo-Controls the emotion of the generated speech. Primary emotions are neutral, calm, angry, content, sad, scared. See docs for more options.
Example:
{
  "use": "tts.cartesia",
  "options": {
    "model_id": "sonic-3",
    "voice_id": "your-voice-id",
    "voice_settings": {
      "speed": 1.0,
      "emotion": "neutral"
    }
  }
}

Complete Examples

A minimal configuration using LayerCode’s hosted LLM agent:
{
  "plugins": [
    {
      "use": "stt.deepgram",
      "options": { "model_id": "nova-3", "language": "en-US" }
    },
    {
      "use": "turn_manager",
      "options": { "base_timeout_ms": 2000 }
    },
    {
      "use": "agent.llm",
      "options": {
        "provider": "google",
        "model_id": "gemini-2.5-flash-lite",
        "system_prompt": "You are a helpful assistant.",
        "welcome_message": "Hi! How can I help you today?"
      }
    },
    { "use": "sentence_buffer" },
    {
      "use": "tts.rime",
      "options": { "model_id": "mistv2", "voice_id": "courtney" }
    }
  ]
}

Audio Format

The pipeline automatically handles audio format conversion based on the client type:
ClientInput FormatOutput Format
BrowserPCM16PCM16
Twiliomulaw @ 8kHzmulaw @ 8kHz
You don’t need to configure audio formats manually - the pipeline negotiates the correct format with each plugin automatically.