Skip to main content
Reducing latency and - especially reducing time-to-first-token - is important for natural-sounding conversations. There are some things that we will always work hard on reducing (e.g. transporting your audio across the internet). But some latency is based on choices and trade-offs you can make. And there are some things that won’t reduce latency directly but may reduce the feeling of latency. A lot of these are even more important if you are doing tool calls or letting agents run in loops - this could take a long time to complete. Here are some tips that could help you reduce latency (or perceived latency) with your voice agents:
  1. Pick a low-TTFT model. We currently recommend Gemini flash-2.5-lite or OpenAI gpt-4o-mini because they deliver the quickest time-to-first-token. Avoid “thinking” or reasoning-extended variants unless you explicitly need them—they trade large amounts of latency for marginal quality gains in spoken conversations.
  2. Prime the user with speech before long work. Inside a tool call, send a response.tts event such as “Let me look that up for you” before you start heavy processing. The SDK will surface it to the client as audio immediately, buying you time without leaving silence. See the tool calling how-to for an example.
  3. Keep users informed during long tool calls. Emit a response.data message as soon as the work starts so the UI can surface a loader or status update—see Sending data to the client and the API reference for Data and state updates. You can also play a short “thinking” audio clip in the browser so the user hears that the agent is still busy.
  4. Be deliberate with RAG. Running retrieval on every turn (especially in loops) adds network hops and can stall a conversation. Fetch external data through tool calls only when it’s needed, and narrate what the agent is doing so the user understands the delay.
  5. Reduce infrastructure round trips. Store conversations in a fast, nearby database—Redis is a good default—and keep ancillary services in the same region as your Layercode deployment to avoid cross-region latency spikes.
I