Using the right system prompt is especially important when building Voice AI Agents. LLMs are primarily trained on written text, so they tend to produce output that is more formal and structured than natural speech. By carefully crafting your prompt, you can guide the model to generate responses that sound more conversational and human-like.

Base System Prompt for Voice AI

Minimal base prompt for Voice AI

You are a helpful conversation voice AI assistant.
You are having a spoken conversation.
Your responses will be read aloud by a text-to-speech system.
You should respond to the user's message in a conversational manner that matches spoken word. Punctuation should still always be included.
Never output markdown, emojis or special characters.
Use contractions naturally.

Pronunciation of numbers, dates & times

Pronunciation of numbers, dates, times, and special characters is also crucial for voice applications. TTS (text-to-speech) providers handle pronunciations in different ways. A good base prompt that guides the LLM to use words to spell out numbers, dates, addresses etc will work for common cases.

Numbers & data rules

Convert the output text into a format suitable for text-to-speech. Ensure that numbers, symbols, and abbreviations are expanded for clarity when read aloud. Expand all abbreviations to their full spoken forms.

Example input and output:

"$42.50" → "forty-two dollars and fifty cents"
"£1,001.32" → "one thousand and one pounds and thirty-two pence"
"1234" → "one thousand two hundred thirty-four"
"3.14" → "three point one four"
"555-555-5555" → "five five five, five five five, five five five five"
"2nd" → "second"
"XIV" → "fourteen" - unless it's a title, then it's "the fourteenth"
"3.5" → "three point five"
"⅔" → "two-thirds"
"Dr." → "Doctor"
"Ave." → "Avenue"
"St." → "Street" (but saints like "St. Patrick" should remain)
"Ctrl + Z" → "control z"
"100km" → "one hundred kilometers"
"100%" → "one hundred percent"
"elevenlabs.io/docs" → "eleven labs dot io slash docs"
"2024-01-01" → "January first, two-thousand twenty-four"
"123 Main St, Anytown, USA" → "one two three Main Street, Anytown, United States of America"
"14:30" → "two thirty PM"
"01/02/2023" → "January second, two-thousand twenty-three" or "the first of February, two-thousand twenty-three", depending on locale of the user

Keep long paragraphs sounding natural

Most text-to-speech systems will change prosody if they receive each sentence individually. If your voice agent needs to speak a large amount of text (e.g. a long legal disclosures or policy statements), follow this guidance to keep paragraphs sounding natural:

Send multiple sentences together. When you already have the full copy (for example, a static disclosure), pass the entire paragraph in a single stream.tts message so the speech engine can maintain the correct intonation.
Wait for the model to finish generating. Fast models such as Gemini 2.5 Flash Lite can produce a paragraph quickly. Instead of streaming each partial sentence as soon as it appears, collect the complete paragraph and then forward the whole string to the TTS provider.

This approach avoids sentence-by-sentence delivery that can make disclosures sound choppy.

Overview

SDKs

How-to guides

Explanations

How to write prompts for voice agents

Base System Prompt for Voice AI

Pronunciation of numbers, dates & times

Keep long paragraphs sounding natural

Overview

SDKs

How-to guides

Explanations

​Base System Prompt for Voice AI

​Pronunciation of numbers, dates & times

​Keep long paragraphs sounding natural

Base System Prompt for Voice AI

Pronunciation of numbers, dates & times

Keep long paragraphs sounding natural