- Pick a low-TTFT model. We currently recommend Gemini flash-2.5-lite or OpenAI gpt-4o-mini because they deliver the quickest time-to-first-token. Avoid “thinking” or reasoning-extended variants unless you explicitly need them—they trade large amounts of latency for marginal quality gains in spoken conversations.
- Prime the user with speech before long work. Inside a tool call, send a
response.tts
event such as “Let me look that up for you” before you start heavy processing. The SDK will surface it to the client as audio immediately, buying you time without leaving silence. See the tool calling how-to for an example. - Keep users informed during long tool calls. Emit a
response.data
message as soon as the work starts so the UI can surface a loader or status update—see Sending data to the client and the API reference for Data and state updates. You can also play a short “thinking” audio clip in the browser so the user hears that the agent is still busy. - Be deliberate with RAG. Running retrieval on every turn (especially in loops) adds network hops and can stall a conversation. Fetch external data through tool calls only when it’s needed, and narrate what the agent is doing so the user understands the delay.
- Reduce infrastructure round trips. Store conversations in a fast, nearby database—Redis is a good default—and keep ancillary services in the same region as your Layercode deployment to avoid cross-region latency spikes.